new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 13

Conservative Offline Robot Policy Learning via Posterior-Transition Reweighting

Offline post-training adapts a pretrained robot policy to a target dataset by supervised regression on recorded actions. In practice, robot datasets are heterogeneous: they mix embodiments, camera setups, and demonstrations of varying quality, so many trajectories reflect recovery behavior, inconsistent operator skill, or weakly informative supervision. Uniform post-training gives equal credit to all samples and can therefore average over conflicting or low-attribution data. We propose Posterior-Transition Reweighting (PTR), a reward-free and conservative post-training method that decides how much each training sample should influence the supervised update. For each sample, PTR encodes the observed post-action consequence as a latent target, inserts it into a candidate pool of mismatched targets, and uses a separate transition scorer to estimate a softmax identification posterior over target indices. The posterior-to-uniform ratio defines the PTR score, which is converted into a clipped-and-mixed weight and applied to the original action objective through self-normalized weighted regression. This construction requires no tractable policy likelihood and is compatible with both diffusion and flow-matching action heads. Rather than uniformly trusting all recorded supervision, PTR reallocates credit according to how attributable each sample's post-action consequence is under the current representation, improving conservative offline adaptation to heterogeneous robot data.

BeingBeyond BeingBeyond
·
Mar 17 2

CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning

Distribution shift is a major obstacle in offline reinforcement learning, which necessitates minimizing the discrepancy between the learned policy and the behavior policy to avoid overestimating rare or unseen actions. Previous conservative offline RL algorithms struggle to generalize to unseen actions, despite their success in learning good in-distribution policy. In contrast, we propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions. We decouple the conservatism constraints from the policy, thus can benefit wide offline RL algorithms. As a consequence, we propose the Conservative Denoising Score-based Algorithm (CDSA) which utilizes the denoising score-based model to model the gradient of the dataset density, rather than the dataset density itself, and facilitates a more accurate and efficient method to adjust the action generated by the pre-trained policy in a deterministic and continuous MDP environment. In experiments, we show that our approach significantly improves the performance of baseline algorithms in D4RL datasets, and demonstrate the generalizability and plug-and-play capability of our model across different pre-trained offline RL policy in different tasks. We also validate that the agent exhibits greater risk aversion after employing our method while showcasing its ability to generalize effectively across diverse tasks.

  • 3 authors
·
Jun 11, 2024

Counterfactual Conservative Q Learning for Offline Multi-agent Reinforcement Learning

Offline multi-agent reinforcement learning is challenging due to the coupling effect of both distribution shift issue common in offline setting and the high dimension issue common in multi-agent setting, making the action out-of-distribution (OOD) and value overestimation phenomenon excessively severe. Tomitigate this problem, we propose a novel multi-agent offline RL algorithm, named CounterFactual Conservative Q-Learning (CFCQL) to conduct conservative value estimation. Rather than regarding all the agents as a high dimensional single one and directly applying single agent methods to it, CFCQL calculates conservative regularization for each agent separately in a counterfactual way and then linearly combines them to realize an overall conservative value estimation. We prove that it still enjoys the underestimation property and the performance guarantee as those single agent conservative methods do, but the induced regularization and safe policy improvement bound are independent of the agent number, which is therefore theoretically superior to the direct treatment referred to above, especially when the agent number is large. We further conduct experiments on four environments including both discrete and continuous action settings on both existing and our man-made datasets, demonstrating that CFCQL outperforms existing methods on most datasets and even with a remarkable margin on some of them.

  • 5 authors
·
Sep 22, 2023

Near-optimal Conservative Exploration in Reinforcement Learning under Episode-wise Constraints

This paper investigates conservative exploration in reinforcement learning where the performance of the learning agent is guaranteed to be above a certain threshold throughout the learning process. It focuses on the tabular episodic Markov Decision Process (MDP) setting that has finite states and actions. With the knowledge of an existing safe baseline policy, an algorithm termed as StepMix is proposed to balance the exploitation and exploration while ensuring that the conservative constraint is never violated in each episode with high probability. StepMix features a unique design of a mixture policy that adaptively and smoothly interpolates between the baseline policy and the optimistic policy. Theoretical analysis shows that StepMix achieves near-optimal regret order as in the constraint-free setting, indicating that obeying the stringent episode-wise conservative constraint does not compromise the learning performance. Besides, a randomization-based EpsMix algorithm is also proposed and shown to achieve the same performance as StepMix. The algorithm design and theoretical analysis are further extended to the setting where the baseline policy is not given a priori but must be learned from an offline dataset, and it is proved that similar conservative guarantee and regret can be achieved if the offline dataset is sufficiently large. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of the proposed conservative exploration strategies.

  • 4 authors
·
Jun 9, 2023

COPlanner: Plan to Roll Out Conservatively but to Explore Optimistically for Model-Based RL

Dyna-style model-based reinforcement learning contains two phases: model rollouts to generate sample for policy learning and real environment exploration using current policy for dynamics model learning. However, due to the complex real-world environment, it is inevitable to learn an imperfect dynamics model with model prediction error, which can further mislead policy learning and result in sub-optimal solutions. In this paper, we propose COPlanner, a planning-driven framework for model-based methods to address the inaccurately learned dynamics model problem with conservative model rollouts and optimistic environment exploration. COPlanner leverages an uncertainty-aware policy-guided model predictive control (UP-MPC) component to plan for multi-step uncertainty estimation. This estimated uncertainty then serves as a penalty during model rollouts and as a bonus during real environment exploration respectively, to choose actions. Consequently, COPlanner can avoid model uncertain regions through conservative model rollouts, thereby alleviating the influence of model error. Simultaneously, it explores high-reward model uncertain regions to reduce model error actively through optimistic real environment exploration. COPlanner is a plug-and-play framework that can be applied to any dyna-style model-based methods. Experimental results on a series of proprioceptive and visual continuous control tasks demonstrate that both sample efficiency and asymptotic performance of strong model-based methods are significantly improved combined with COPlanner.

  • 7 authors
·
Oct 11, 2023

DeliveryBench: Can Agents Earn Profit in Real World?

LLMs and VLMs are increasingly deployed as embodied agents, yet existing benchmarks largely revolve around simple short-term tasks and struggle to capture rich realistic constraints that shape real-world decision making. To close this gap, we propose DeliveryBench, a city-scale embodied benchmark grounded in the real-world profession of food delivery. Food couriers naturally operate under long-horizon objectives (maximizing net profit over hours) while managing diverse constraints, e.g., delivery deadline, transportation expense, vehicle battery, and necessary interactions with other couriers and customers. DeliveryBench instantiates this setting in procedurally generated 3D cities with diverse road networks, buildings, functional locations, transportation modes, and realistic resource dynamics, enabling systematic evaluation of constraint-aware, long-horizon planning. We benchmark a range of VLM-based agents across nine cities and compare them with human players. Our results reveal a substantial performance gap to humans, and find that these agents are short-sighted and frequently break basic commonsense constraints. Additionally, we observe distinct personalities across models (e.g., adventurous GPT-5 vs. conservative Claude), highlighting both the brittleness and the diversity of current VLM-based embodied agents in realistic, constraint-dense environments. Our code, data, and benchmark are available at https://deliverybench.github.io.

  • 6 authors
·
Dec 22, 2025

Accelerating Diffusion LLM Inference via Local Determinism Propagation

Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations--a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94times throughput improvements and reduces decoding steps to just 14.2\% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.

  • 7 authors
·
Oct 8, 2025

Algorithm-assisted discovery of an intrinsic order among mathematical constants

In recent decades, a growing number of discoveries in fields of mathematics have been assisted by computer algorithms, primarily for exploring large parameter spaces that humans would take too long to investigate. As computers and algorithms become more powerful, an intriguing possibility arises - the interplay between human intuition and computer algorithms can lead to discoveries of novel mathematical concepts that would otherwise remain elusive. To realize this perspective, we have developed a massively parallel computer algorithm that discovers an unprecedented number of continued fraction formulas for fundamental mathematical constants. The sheer number of formulas discovered by the algorithm unveils a novel mathematical structure that we call the conservative matrix field. Such matrix fields (1) unify thousands of existing formulas, (2) generate infinitely many new formulas, and most importantly, (3) lead to unexpected relations between different mathematical constants, including multiple integer values of the Riemann zeta function. Conservative matrix fields also enable new mathematical proofs of irrationality. In particular, we can use them to generalize the celebrated proof by Ap\'ery for the irrationality of zeta(3). Utilizing thousands of personal computers worldwide, our computer-supported research strategy demonstrates the power of experimental mathematics, highlighting the prospects of large-scale computational approaches to tackle longstanding open problems and discover unexpected connections across diverse fields of science.

  • 9 authors
·
Aug 22, 2023

Jurisdiction as Structural Barrier: How Privacy Policy Organization May Reduce Visibility of Substantive Disclosures

Privacy policies are supposed to provide notice. But what if substantive information appears only where users skip it? We identify a structural pattern we call jurisdiction-siloed disclosure: information about data practices appearing in specific, actionable form only within regional compliance sections labeled "California Residents" or "EU/UK Users," while general sections use vague or qualified language for the same practices. Our audit of 123 major companies identifies 282 potential instances across 77 companies (62.6% of this purposive sample). A conservative estimate restricted to practice categories validated against OPP-115 human annotations finds 138 instances across 54 companies (44%); post-2018 categories central to our findings await independent validation. If users skip jurisdiction-labeled sections as information foraging theory predicts, users outside regulated jurisdictions would receive less specific information about practices affecting them--a transparency failure operating through document architecture rather than omission. We propose universal substantive disclosure: practices affecting all users should appear in the main policy body, with regional sections containing only procedural rights information. This standard finds support in analogous disclosure regimes (securities, truth-in-lending, nutritional labeling) where material information must reach all affected parties. Regulators could operationalize this through the FTC's "clear and conspicuous" standard and GDPR transparency principles. This work is hypothesis-generating: we establish that the structural pattern exists and ground the transparency concern in behavioral theory, but direct measurement of jurisdiction-specific section skipping remains the critical validation priority. We release our methodology and annotated dataset to enable replication.

  • 1 authors
·
Jan 28

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions): some lead to consistent, semantically similar answers, while others yield diverse or contradictory outputs. This variation reflects LLM's uncertainty about the input prompt, a signal of how confidently the model understands a given problem. However, vanilla Group Relative Policy Optimization (GRPO) treats all prompts equally during policy updates, ignoring this important information about the model's knowledge boundaries. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs' uncertainty of the input prompts semantic entropy. Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates. This uncertainty-aware training mechanism enables dynamic adjustment of policy update magnitudes based on question uncertainty. It allows more conservative updates on high-uncertainty questions while maintaining the original learning signal on confident ones. Experimental results on five mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva 34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves new state-of-the-art performance in average accuracy, validating the effectiveness of uncertainty-aware policy optimization.

  • 4 authors
·
May 18, 2025 16

Fast-Decoding Diffusion Language Models via Progress-Aware Confidence Schedules

Diffusion large language models (dLLMs) offer a promising alternative to autoregressive models, but their practical utility is severely hampered by slow, iterative sampling. We present SchED, a training-free, model-agnostic early-exit algorithm that aggregates full-span logit margins and halts decoding once a smooth, progress-dependent confidence threshold is met. We evaluated SchED on two dLLM families (Dream and LLaDA), in base and instruction-tuned variants across ten benchmarks spanning downstream tasks including multiple-choice question answering (MCQ), math, long-form QA/summarization, and translation. SchED delivers large, stable accelerations: on instruction-tuned models, it achieves 3.8-4.0times speedups while retaining 99.8-100% of the baseline score on average. On base models, SchED yields consistent speedup gains with 99.1-100% performance retention, with up to 2.34times under more aggressive settings. Using a conservative speed metric that heavily penalizes quality loss (QPS, γ{=}4), we show that SchED is robust and clearly outperforms prior confidence-based early-exit methods, which break down on long-form generation. An entropy analysis of the model's token predictions reveals that instruction tuning speeds up the decay of predictive entropy. By turning genuine confidence stabilization into computational savings, SchED makes dLLM decoding substantially more efficient.

  • 4 authors
·
Dec 2, 2025 2

Docs2KG: Unified Knowledge Graph Construction from Heterogeneous Documents Assisted by Large Language Models

Even for a conservative estimate, 80% of enterprise data reside in unstructured files, stored in data lakes that accommodate heterogeneous formats. Classical search engines can no longer meet information seeking needs, especially when the task is to browse and explore for insight formulation. In other words, there are no obvious search keywords to use. Knowledge graphs, due to their natural visual appeals that reduce the human cognitive load, become the winning candidate for heterogeneous data integration and knowledge representation. In this paper, we introduce Docs2KG, a novel framework designed to extract multimodal information from diverse and heterogeneous unstructured documents, including emails, web pages, PDF files, and Excel files. Dynamically generates a unified knowledge graph that represents the extracted key information, Docs2KG enables efficient querying and exploration of document data lakes. Unlike existing approaches that focus on domain-specific data sources or pre-designed schemas, Docs2KG offers a flexible and extensible solution that can adapt to various document structures and content types. The proposed framework unifies data processing supporting a multitude of downstream tasks with improved domain interpretability. Docs2KG is publicly accessible at https://docs2kg.ai4wa.com, and a demonstration video is available at https://docs2kg.ai4wa.com/Video.

  • 8 authors
·
Jun 5, 2024

Reward Model Ensembles Help Mitigate Overoptimization

Reinforcement learning from human feedback (RLHF) is a standard approach for fine-tuning large language models to follow instructions. As part of this process, learned reward models are used to approximately model human preferences. However, as imperfect representations of the "true" reward, these learned reward models are susceptible to overoptimization. Gao et al. (2023) studied this phenomenon in a synthetic human feedback setup with a significantly larger "gold" reward model acting as the true reward (instead of humans) and showed that overoptimization remains a persistent problem regardless of the size of the proxy reward model and training data used. Using a similar setup, we conduct a systematic study to evaluate the efficacy of using ensemble-based conservative optimization objectives, specifically worst-case optimization (WCO) and uncertainty-weighted optimization (UWO), for mitigating reward model overoptimization when using two optimization methods: (a) best-of-n sampling (BoN) (b) proximal policy optimization (PPO). We additionally extend the setup of Gao et al. (2023) to include 25% label noise to better mirror real-world conditions. Both with and without label noise, we find that conservative optimization practically eliminates overoptimization and improves performance by up to 70% for BoN sampling. For PPO, ensemble-based conservative optimization always reduces overoptimization and outperforms single reward model optimization. Moreover, combining it with a small KL penalty successfully prevents overoptimization at no performance cost. Overall, our results demonstrate that ensemble-based conservative optimization can effectively counter overoptimization.

  • 4 authors
·
Oct 4, 2023

Offline Guarded Safe Reinforcement Learning for Medical Treatment Optimization Strategies

When applying offline reinforcement learning (RL) in healthcare scenarios, the out-of-distribution (OOD) issues pose significant risks, as inappropriate generalization beyond clinical expertise can result in potentially harmful recommendations. While existing methods like conservative Q-learning (CQL) attempt to address the OOD issue, their effectiveness is limited by only constraining action selection by suppressing uncertain actions. This action-only regularization imitates clinician actions that prioritize short-term rewards, but it fails to regulate downstream state trajectories, thereby limiting the discovery of improved long-term treatment strategies. To safely improve policy beyond clinician recommendations while ensuring that state-action trajectories remain in-distribution, we propose Offline Guarded Safe Reinforcement Learning (OGSRL), a theoretically grounded model-based offline RL framework. OGSRL introduces a novel dual constraint mechanism for improving policy with reliability and safety. First, the OOD guardian is established to specify clinically validated regions for safe policy exploration. By constraining optimization within these regions, it enables the reliable exploration of treatment strategies that outperform clinician behavior by leveraging the full patient state history, without drifting into unsupported state-action trajectories. Second, we introduce a safety cost constraint that encodes medical knowledge about physiological safety boundaries, providing domain-specific safeguards even in areas where training data might contain potentially unsafe interventions. Notably, we provide theoretical guarantees on safety and near-optimality: policies that satisfy these constraints remain in safe and reliable regions and achieve performance close to the best possible policy supported by the data.

  • 6 authors
·
May 22, 2025

NeoRL-2: Near Real-World Benchmarks for Offline Reinforcement Learning with Extended Realistic Scenarios

Offline reinforcement learning (RL) aims to learn from historical data without requiring (costly) access to the environment. To facilitate offline RL research, we previously introduced NeoRL, which highlighted that datasets from real-world tasks are often conservative and limited. With years of experience applying offline RL to various domains, we have identified additional real-world challenges. These include extremely conservative data distributions produced by deployed control systems, delayed action effects caused by high-latency transitions, external factors arising from the uncontrollable variance of transitions, and global safety constraints that are difficult to evaluate during the decision-making process. These challenges are underrepresented in previous benchmarks but frequently occur in real-world tasks. To address this, we constructed the extended Near Real-World Offline RL Benchmark (NeoRL-2), which consists of 7 datasets from 7 simulated tasks along with their corresponding evaluation simulators. Benchmarking results from state-of-the-art offline RL approaches demonstrate that current methods often struggle to outperform the data-collection behavior policy, highlighting the need for more effective methods. We hope NeoRL-2 will accelerate the development of reinforcement learning algorithms for real-world applications. The benchmark project page is available at https://github.com/polixir/NeoRL2.

  • 6 authors
·
Mar 24, 2025

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed Salt, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at https://github.com/XingtongGe/Salt{https://github.com/XingtongGe/Salt}.

  • 9 authors
·
Apr 2 2

ADAS-TO: A Large-Scale Multimodal Naturalistic Dataset and Empirical Characterization of Human Takeovers during ADAS Engagement

Takeovers remain a key safety vulnerability in production ADAS, yet existing public resources rarely provide takeover-centered, real-world data. We present ADAS-TO, the first large-scale naturalistic dataset dedicated to ADAS-to-manual transitions, containing 15,659 takeover-centered 20s clips from 327 drivers across 22 vehicle brands. Each clip synchronizes front-view video with CAN logs. Takeovers are defined as ADAS ON rightarrow OFF transitions, with the primary trigger labeled as brake, steer, gas, mixed, or system disengagement. We further separate planned driver-initiated terminations (Ego) from forced takeovers (Non-ego) using a rule-based partition. While most events occur within conservative kinematic margins, we identify a long tail of 285 safety-critical cases. For these events, we combine kinematic screening with vision--language (VLM) annotation to attribute hazards and relate them to intervention dynamics. The resulting cross-modal analysis shows distinct kinematic signatures across traffic dynamics, infrastructure degradation, and adverse environments, and finds that in 59.3% of critical cases, actionable visual cues emerge at least 3s before takeover, supporting the potential for semantics-aware early warning beyond late-stage kinematic triggers. The dataset is publicly released at huggingface.co/datasets/HenryYHW/ADAS-TO-Sample.

  • 4 authors
·
Mar 6

Are We Really Learning the Score Function? Reinterpreting Diffusion Models Through Wasserstein Gradient Flow Matching

Diffusion models are commonly interpreted as learning the score function, i.e., the gradient of the log-density of noisy data. However, this assumption implies that the target of learning is a conservative vector field, which is not enforced by the neural network architectures used in practice. We present numerical evidence that trained diffusion networks violate both integral and differential constraints required of true score functions, demonstrating that the learned vector fields are not conservative. Despite this, the models perform remarkably well as generative mechanisms. To explain this apparent paradox, we advocate a new theoretical perspective: diffusion training is better understood as flow matching to the velocity field of a Wasserstein Gradient Flow (WGF), rather than as score learning for a reverse-time stochastic differential equation. Under this view, the "probability flow" arises naturally from the WGF framework, eliminating the need to invoke reverse-time SDE theory and clarifying why generative sampling remains successful even when the neural vector field is not a true score. We further show that non-conservative errors from neural approximation do not necessarily harm density transport. Our results advocate for adopting the WGF perspective as a principled, elegant, and theoretically grounded framework for understanding diffusion generative models.

  • 4 authors
·
Aug 29, 2025

Past-Future Scheduler for LLM Serving under SLA Guarantees

The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely estimates the peak memory resources required by the running batch via considering the historical distribution of request output lengths and calculating memory occupancy at each future time point. It adapts to applications with all types of input-output length distributions, balancing the trade-off between request queuing and harmful evictions, thereby consistently achieving better goodput. Furthermore, to validate the effectiveness of the proposed scheduler, we developed a high-performance LLM serving framework, LightLLM, that implements the Past-Future scheduler. Compared to existing aggressive or conservative schedulers, LightLLM demonstrates superior goodput, achieving up to 2-3times higher goodput than other schedulers under heavy loads. LightLLM is open source to boost the research in such direction (https://github.com/ModelTC/lightllm).

  • 8 authors
·
Jul 14, 2025

Policy Regularization with Dataset Constraint for Offline Reinforcement Learning

We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL). A common taxonomy of existing offline RL works is policy regularization, which typically constrains the learned policy by distribution or support of the behavior policy. However, distribution and support constraints are overly conservative since they both force the policy to choose similar actions as the behavior policy when considering particular states. It will limit the learned policy's performance, especially when the behavior policy is sub-optimal. In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with Dataset Constraint (PRDC). When updating the policy in a given state, PRDC searches the entire dataset for the nearest state-action sample and then restricts the policy with the action of this sample. Unlike previous works, PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state. It is a softer constraint but still keeps enough conservatism from out-of-distribution actions. Empirical evidence and theoretical analysis show that PRDC can alleviate offline RL's fundamentally challenging value overestimation issue with a bounded performance gap. Moreover, on a set of locomotion and navigation tasks, PRDC achieves state-of-the-art performance compared with existing methods. Code is available at https://github.com/LAMDA-RL/PRDC

  • 5 authors
·
Jun 10, 2023

Leveraging Word Guessing Games to Assess the Intelligence of Large Language Models

The automatic evaluation of LLM-based agent intelligence is critical in developing advanced LLM-based agents. Although considerable effort has been devoted to developing human-annotated evaluation datasets, such as AlpacaEval, existing techniques are costly, time-consuming, and lack adaptability. In this paper, inspired by the popular language game ``Who is Spy'', we propose to use the word guessing game to assess the intelligence performance of LLMs. Given a word, the LLM is asked to describe the word and determine its identity (spy or not) based on its and other players' descriptions. Ideally, an advanced agent should possess the ability to accurately describe a given word using an aggressive description while concurrently maximizing confusion in the conservative description, enhancing its participation in the game. To this end, we first develop DEEP to evaluate LLMs' expression and disguising abilities. DEEP requires LLM to describe a word in aggressive and conservative modes. We then introduce SpyGame, an interactive multi-agent framework designed to assess LLMs' intelligence through participation in a competitive language-based board game. Incorporating multi-agent interaction, SpyGame requires the target LLM to possess linguistic skills and strategic thinking, providing a more comprehensive evaluation of LLMs' human-like cognitive abilities and adaptability in complex communication situations. The proposed evaluation framework is very easy to implement. We collected words from multiple sources, domains, and languages and used the proposed evaluation framework to conduct experiments. Extensive experiments demonstrate that the proposed DEEP and SpyGame effectively evaluate the capabilities of various LLMs, capturing their ability to adapt to novel situations and engage in strategic communication.

  • 10 authors
·
Oct 31, 2023 1

Experimental and Computational Analysis of the Hydrodynamics of Droplet Generation in a Cylindrical Microfluidic Device

This study investigates the hydrodynamics of droplet formation in a T-shaped cylindrical microfluidic device using micro-PIV experiments and CFD simulations. Devices of 150 micro-m internal diameter were fabricated from PDMS via a cost-effective embedded templating method. Flow visualization was conducted using immiscible silicone oil and deionized water, forming water-in-oil droplets. A mathematical model coupling the Navier-Stokes and conservative level-set equations was solved using the finite element method. Detailed flow fields (velocity, pressure, and phase distribution) were obtained over a wide range of flow-rate ratios (0.1-10) and capillary numbers (0.001-0.1) to characterize droplet formation mechanisms. Phase evolution revealed distinct breakup stages (lag, filling, necking, and pinch-off) and multiple regimes (squeezing, dripping, sausage flow, and parallel flow with tip streaming). A regime map delineating droplet and non-droplet regions was developed. Droplet size, curvature, and internal flow profiles exhibited strong dependence on Ca and Qr. Scaling analysis showed linear dependence of droplet size on Qr in the squeezing regime, with curvature nearly independent of Qr. In contrast, both size and curvature followed power-law dependence on Ca and Qr in the dripping regime. Velocity fields inside droplets were laminar and parabolic in the core. Fully developed plug-like profiles appeared in squeezing, whereas front and rear regions remained developing in dripping. Correlations for droplet length, curvature, and film thickness, including a novel thin-film model incorporating visco-inertial and capillary effects, enable predictive design within the studied range. These findings advance fundamental understanding of confined droplet dynamics and provide quantitative guidelines for optimizing droplet-based microfluidic systems.

  • 3 authors
·
Mar 3

Phase Transition for Budgeted Multi-Agent Synergy

Multi-agent systems can improve reliability, yet under a fixed inference budget they often help, saturate, or even collapse. We develop a minimal and calibratable theory that predicts these regimes from three binding constraints of modern agent stacks: finite context windows, lossy inter-agent communication, and shared failures among similar agents. Each leaf agent is summarized by a compute-performance scaling exponent β; communication is captured by a message-length fidelity curve γ(m); dependence is captured by an effective shared-error correlation ρ; and a context window W imposes hard fan-in limits that make hierarchy necessary. For binary success/failure tasks with majority aggregation, we prove a sharp phase transition for deep b-ary trees with correlated inputs and lossy communication: a single scalar α_ρ (combining γ(m), ρ, and fan-in b) determines whether weak signal is amplified to a nontrivial fixed point or washed out to chance. In the amplifying regime, we derive an organization exponent s and show that budgeted synergy, i.e., outperforming the best single agent under the same total budget, occurs exactly when s>β, yielding closed-form compute allocation rules and explicit budget thresholds. We further characterize saturation via a mixing depth and provide a conservative clipped predictor that remains accurate across growth and saturation. A continuous-performance warm-up gives closed-form risks for star, chain, and tree organizations, making correlation- and communication-induced floors explicit and exposing the core design trade-offs in a smooth setting. Finally, we validate the predicted phase boundaries in controlled synthetic simulations and show how the same mechanisms explain the dominant bottlenecks reported in recent large-scale matched-budget studies of LLM agent-system scaling.

  • 3 authors
·
Jan 24

GENIUS: Sketch-based Language Model Pre-training via Extreme and Selective Masking for Text Generation and Augmentation

We introduce GENIUS: a conditional text generation model using sketches as input, which can fill in the missing contexts for a given sketch (key information consisting of textual spans, phrases, or words, concatenated by mask tokens). GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective using an extreme and selective masking strategy, enabling it to generate diverse and high-quality texts given sketches. Comparison with other competitive conditional language models (CLMs) reveals the superiority of GENIUS's text generation quality. We further show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks. Most existing textual data augmentation methods are either too conservative, by making small changes to the original text, or too aggressive, by creating entirely new samples. With GENIUS, we propose GeniusAug, which first extracts the target-aware sketches from the original training set and then generates new samples based on the sketches. Empirical experiments on 6 text classification datasets show that GeniusAug significantly improves the models' performance in both in-distribution (ID) and out-of-distribution (OOD) settings. We also demonstrate the effectiveness of GeniusAug on named entity recognition (NER) and machine reading comprehension (MRC) tasks. (Code and models are publicly available at https://github.com/microsoft/SCGLab and https://github.com/beyondguo/genius)

  • 7 authors
·
Nov 18, 2022

ALOE: Action-Level Off-Policy Evaluation for Vision-Language-Action Model Post-Training

We study how to improve large foundation vision-language-action (VLA) systems through online reinforcement learning (RL) in real-world settings. Central to this process is the value function, which provides learning signals to guide VLA learning from experience. In practice, the value function is estimated from trajectory fragments collected from different data sources, including historical policies and intermittent human interventions. Estimating the value function of current behavior quality from the mixture data is inherently an off-policy evaluation problem. However, prior work often adopts conservative on-policy estimation for stability, which avoids direct evaluation of the current high-capacity policy and limits learning effectiveness. In this paper, we propose ALOE, an action-level off-policy evaluation framework for VLA post-training. ALOE applies chunking-based temporal-difference bootstrapping to evaluate individual action sequences instead of predicting final task outcomes. This design improves effective credit assignment to critical action chunks under sparse rewards and supports stable policy improvement. We evaluate our method on three real-world manipulation tasks, including smartphone packing as a high-precision task, laundry folding as a long-horizon deformable-object task, and bimanual pick-and-place involving multi-object perception. Across all tasks, ALOE improves learning efficiency without compromising execution speed, showing that off-policy RL can be reintroduced in a reliable manner for real-world VLA post-training. Videos and additional materials are available at our project website.

  • 13 authors
·
Feb 13

Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests

The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.

  • 2 authors
·
Feb 7, 2025 2

New Adaptive Numerical Methods Based on Dual Formulation of Hyperbolic Conservation Laws

In this paper, we propose an adaptive high-order method for hyperbolic systems of conservation laws. The proposed method is based on a dual formulation approach: Two numerical solutions, corresponding to conservative and nonconservative formulations of the same system, are evolved simultaneously. Since nonconservative schemes are known to produce nonphysical weak solutions near discontinuities, we exploit the difference between these two solutions to construct a smoothness indicator (SI). In smooth regions, the difference between the conservative and nonconservative solutions is of the same order as the truncation error of the underlying discretization, whereas in nonsmooth regions, it is {cal O}(1). We apply this idea to the Euler equations of gas dynamics and define the SI using differences in the momentum and pressure variables. This choice allows us to further distinguish neighborhoods of contact discontinuities from other nonsmooth parts of the computed solution. The resulting classification is used to adaptively select numerical discretizations. In the vicinities of contact discontinuities, we employ the low-dissipation central-upwind numerical flux and a second-order piecewise linear reconstruction with the slopes computed using an overcompressive SBM limiter. Elsewhere, we use an alternative weighted essentially non-oscillatory (A-WENO) framework with the central-upwind finite-volume numerical fluxes and either unlimited (in smooth regions) or Ai-WENO-Z (in the nonsmooth regions away from contact discontinuities) fifth-order interpolation. Numerical results for the one- and two-dimensional compressible Euler equations show that the proposed adaptive method improves both the computational efficiency and resolution of complex flow features compared with the non-adaptive fifth-order A-WENO scheme.

  • 4 authors
·
Jan 27

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Reinforcement Learning, particularly through policy gradient methods, has played a central role in enabling reasoning capabilities of Large Language Models. However, the optimization stability of policy gradients in this setting remains understudied. As a result, existing implementations often resort to conservative hyperparameter choices to ensure stability, which requires more training samples and increases computational costs. Hence, developing models for reliably tracking the underlying optimization dynamics and leveraging them into training enables more sample-efficient regimes and further unleashes scalable post-training. We address this gap by formalizing the stochastic optimization problem of policy gradients with explicit consideration of second-order geometry. We propose a tractable computational framework that tracks and leverages curvature information during policy updates. We further employ this framework to design interventions in the optimization process through data selection. The resultant algorithm, Curvature-Aware Policy Optimization (CAPO), identifies samples that contribute to unstable updates and masks them out. Theoretically, we establish monotonic improvement guarantees under realistic assumptions. On standard math reasoning benchmarks, we empirically show that CAPO ensures stable updates under aggressive learning regimes where baselines catastrophically fail. With minimal intervention (rejecting fewer than 8% of tokens), CAPO achieves up to 30x improvement in sample efficiency over standard GRPO for LLM reasoning.

  • 3 authors
·
Oct 1, 2025

Beyond the Surface: Probing the Ideological Depth of Large Language Models

Large Language Models (LLMs) have demonstrated pronounced ideological leanings, yet the stability and depth of these positions remain poorly understood. Surface-level responses can often be manipulated through simple prompt engineering, calling into question whether they reflect a coherent underlying ideology. This paper investigates the concept of "ideological depth" in LLMs, defined as the robustness and complexity of their internal political representations. We employ a dual approach: first, we measure the "steerability" of two well-known open-source LLMs using instruction prompting and activation steering. We find that while some models can easily switch between liberal and conservative viewpoints, others exhibit resistance or an increased rate of refusal, suggesting a more entrenched ideological structure. Second, we probe the internal mechanisms of these models using Sparse Autoencoders (SAEs). Preliminary analysis reveals that models with lower steerability possess more distinct and abstract ideological features. Our evaluations reveal that one model can contain 7.3x more political features than another model of similar size. This allows targeted ablation of a core political feature in an ideologically "deep" model, leading to consistent, logical shifts in its reasoning across related topics, whereas the same intervention in a "shallow" model results in an increase in refusal outputs. Our findings suggest that ideological depth is a quantifiable property of LLMs and that steerability serves as a valuable window into their latent political architecture.

  • 3 authors
·
Aug 29, 2025

Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference

We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference that dynamically switches between different-sized models based on prediction uncertainty. By monitoring rolling entropy in model logit distributions, our method identifies text regions where a smaller model suffices and switches to a larger model only when prediction uncertainty exceeds a threshold. Unlike speculative decoding approaches that maintain perfect output fidelity through verification, EAD accepts controlled output divergence in exchange for computational efficiency. Our experiments on the MATH benchmark demonstrate remarkable efficiency gains across different model families. Using the LLaMA family, we maintain 96.7\% of the 11B model's performance (50.4\% vs 52.1\%) while using it for only 43\% of tokens, decreasing computational cost by 41.5\%. These gains become more pronounced with larger size differentials in the Qwen family, where we achieve 92.9\% of the 14B model's performance (74.3\% vs 80.0\%) while using it for just 25\% of tokens, decreasing computational cost by 67\%. The consistency of these results across model pairs suggests that language model computation can be significantly optimized by selectively deploying model capacity based on local generation complexity. Our findings indicate that current approaches to model inference may be unnecessarily conservative in their pursuit of perfect output fidelity, and that accepting minor performance trade-offs can enable dramatic reductions in computational costs.

  • 1 authors
·
Feb 5, 2025

Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to "I don't know") persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conservative, limiting their problem-solving capabilities. To mitigate hallucination and laziness in reasoning tasks, we propose Automatic Curriculum Expert Iteration (Auto-CEI) to enhance LLM reasoning and align responses to the model's capabilities--assertively answering within its limits and declining when tasks exceed them. In our method, Expert Iteration explores the reasoning trajectories near the LLM policy, guiding incorrect paths back on track to reduce compounding errors and improve robustness; it also promotes appropriate "I don't know" responses after sufficient reasoning attempts. The curriculum automatically adjusts rewards, incentivizing extended reasoning before acknowledging incapability, thereby pushing the limits of LLM reasoning and aligning its behaviour with these limits. We compare Auto-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where Auto-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness.

  • 5 authors
·
Oct 10, 2024

The Relational Machine Calculus

This paper presents the Relational Machine Calculus (RMC): a simple, foundational model of first-order relational programming. The RMC originates from the Functional Machine Calculus (FMC), which generalizes the lambda-calculus and its standard call-by-name stack machine in two directions. One, "locations", introduces multiple stacks, which enable effect operators to be encoded into the abstraction and application constructs. The second, "sequencing", introduces the imperative notions of "skip" and "sequence", similar to kappa-calculus and concatenative programming languages. The key observation of the RMC is that the first-order fragment of the FMC exhibits a latent duality which, given a simple decomposition of the relevant constructors, can be concretely expressed as an involution on syntax. Semantically, this gives rise to a sound and complete calculus for string diagrams of Frobenius monoids. We consider unification as the corresponding symmetric generalization of beta-reduction. By further including standard operators of Kleene algebra, the RMC embeds a range of computational models: the kappa-calculus, logic programming, automata, Interaction Nets, and Petri Nets, among others. These embeddings preserve operational semantics, which for the RMC is again given by a generalization of the standard stack machine for the lambda-calculus. The equational theory of the RMC (which supports reasoning about its operational semantics) is conservative over both the first-order lambda-calculus and Kleene algebra, and can be oriented to give a confluent reduction relation.

  • 3 authors
·
May 17, 2024

Bold but Cautious: Unlocking the Potential of Personalized Federated Learning through Cautiously Aggressive Collaboration

Personalized federated learning (PFL) reduces the impact of non-independent and identically distributed (non-IID) data among clients by allowing each client to train a personalized model when collaborating with others. A key question in PFL is to decide which parameters of a client should be localized or shared with others. In current mainstream approaches, all layers that are sensitive to non-IID data (such as classifier layers) are generally personalized. The reasoning behind this approach is understandable, as localizing parameters that are easily influenced by non-IID data can prevent the potential negative effect of collaboration. However, we believe that this approach is too conservative for collaboration. For example, for a certain client, even if its parameters are easily influenced by non-IID data, it can still benefit by sharing these parameters with clients having similar data distribution. This observation emphasizes the importance of considering not only the sensitivity to non-IID data but also the similarity of data distribution when determining which parameters should be localized in PFL. This paper introduces a novel guideline for client collaboration in PFL. Unlike existing approaches that prohibit all collaboration of sensitive parameters, our guideline allows clients to share more parameters with others, leading to improved model performance. Additionally, we propose a new PFL method named FedCAC, which employs a quantitative metric to evaluate each parameter's sensitivity to non-IID data and carefully selects collaborators based on this evaluation. Experimental results demonstrate that FedCAC enables clients to share more parameters with others, resulting in superior performance compared to state-of-the-art methods, particularly in scenarios where clients have diverse distributions.

  • 5 authors
·
Sep 20, 2023

A Workflow for Offline Model-Free Robotic Reinforcement Learning

Offline reinforcement learning (RL) enables learning control policies by utilizing only prior experience, without any online interaction. This can allow robots to acquire generalizable skills from large and diverse datasets, without any costly or unsafe online data collection. Despite recent algorithmic advances in offline RL, applying these methods to real-world problems has proven challenging. Although offline RL methods can learn from prior data, there is no clear and well-understood process for making various design choices, from model architecture to algorithm hyperparameters, without actually evaluating the learned policies online. In this paper, our aim is to develop a practical workflow for using offline RL analogous to the relatively well-understood workflows for supervised learning problems. To this end, we devise a set of metrics and conditions that can be tracked over the course of offline training, and can inform the practitioner about how the algorithm and model architecture should be adjusted to improve final performance. Our workflow is derived from a conceptual understanding of the behavior of conservative offline RL algorithms and cross-validation in supervised learning. We demonstrate the efficacy of this workflow in producing effective policies without any online tuning, both in several simulated robotic learning scenarios and for three tasks on two distinct real robots, focusing on learning manipulation skills with raw image observations with sparse binary rewards. Explanatory video and additional results can be found at sites.google.com/view/offline-rl-workflow

  • 5 authors
·
Sep 22, 2021

FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation

Ensuring the safety of LLM-generated content is essential for real-world deployment. Most existing guardrail models formulate moderation as a fixed binary classification task, implicitly assuming a fixed definition of harmfulness. In practice, enforcement strictness - how conservatively harmfulness is defined and enforced - varies across platforms and evolves over time, making binary moderators brittle under shifting requirements. We first introduce FlexBench, a strictness-adaptive LLM moderation benchmark that enables controlled evaluation under multiple strictness regimes. Experiments on FlexBench reveal substantial cross-strictness inconsistency in existing moderators: models that perform well under one regime can degrade substantially under others, limiting their practical usability. To address this, we propose FlexGuard, an LLM-based moderator that outputs a calibrated continuous risk score reflecting risk severity and supports strictness-specific decisions via thresholding. We train FlexGuard via risk-alignment optimization to improve score-severity consistency and provide practical threshold selection strategies to adapt to target strictness at deployment. Experiments on FlexBench and public benchmarks demonstrate that FlexGuard achieves higher moderation accuracy and substantially improved robustness under varying strictness. We release the source code and data to support reproducibility.

  • 4 authors
·
Feb 26

Diminished Diversity-of-Thought in a Standard Large Language Model

We test whether Large Language Models (LLMs) can be used to simulate human participants in social-science studies. To do this, we run replications of 14 studies from the Many Labs 2 replication project with OpenAI's text-davinci-003 model, colloquially known as GPT3.5. Based on our pre-registered analyses, we find that among the eight studies we could analyse, our GPT sample replicated 37.5% of the original results and 37.5% of the Many Labs 2 results. However, we were unable to analyse the remaining six studies due to an unexpected phenomenon we call the "correct answer" effect. Different runs of GPT3.5 answered nuanced questions probing political orientation, economic preference, judgement, and moral philosophy with zero or near-zero variation in responses: with the supposedly "correct answer." In one exploratory follow-up study, we found that a "correct answer" was robust to changing the demographic details that precede the prompt. In another, we found that most but not all "correct answers" were robust to changing the order of answer choices. One of our most striking findings occurred in our replication of the Moral Foundations Theory survey results, where we found GPT3.5 identifying as a political conservative in 99.6% of the cases, and as a liberal in 99.3% of the cases in the reverse-order condition. However, both self-reported 'GPT conservatives' and 'GPT liberals' showed right-leaning moral foundations. Our results cast doubts on the validity of using LLMs as a general replacement for human participants in the social sciences. Our results also raise concerns that a hypothetical AI-led future may be subject to a diminished diversity-of-thought.

  • 3 authors
·
Feb 13, 2023

Extracting alignment data in open models

In this work, we show that it is possible to extract significant amounts of alignment training data from a post-trained model -- useful to steer the model to improve certain capabilities such as long-context reasoning, safety, instruction following, and maths. While the majority of related work on memorisation has focused on measuring success of training data extraction through string matching, we argue that embedding models are better suited for our specific goals. Distances measured through a high quality embedding model can identify semantic similarities between strings that a different metric such as edit distance will struggle to capture. In fact, in our investigation, approximate string matching would have severely undercounted (by a conservative estimate of 10times) the amount of data that can be extracted due to trivial artifacts that deflate the metric. Interestingly, we find that models readily regurgitate training data that was used in post-training phases such as SFT or RL. We show that this data can be then used to train a base model, recovering a meaningful amount of the original performance. We believe our work exposes a possibly overlooked risk towards extracting alignment data. Finally, our work opens up an interesting discussion on the downstream effects of distillation practices: since models seem to be regurgitating aspects of their training set, distillation can therefore be thought of as indirectly training on the model's original dataset.

google Google
·
Oct 21, 2025 5

Statically Contextualizing Large Language Models with Typed Holes

Large language models (LLMs) have reshaped the landscape of program synthesis. However, contemporary LLM-based code completion systems often hallucinate broken code because they lack appropriate context, particularly when working with definitions not in the training data nor near the cursor. This paper demonstrates that tight integration with the type and binding structure of a language, as exposed by its language server, can address this contextualization problem in a token-efficient manner. In short, we contend that AIs need IDEs, too! In particular, we integrate LLM code generation into the Hazel live program sketching environment. The Hazel Language Server identifies the type and typing context of the hole being filled, even in the presence of errors, ensuring that a meaningful program sketch is always available. This allows prompting with codebase-wide contextual information not lexically local to the cursor, nor necessarily in the same file, but that is likely to be semantically local to the developer's goal. Completions synthesized by the LLM are then iteratively refined via further dialog with the language server. To evaluate these techniques, we introduce MVUBench, a dataset of model-view-update (MVU) web applications. These applications serve as challenge problems due to their reliance on application-specific data structures. We find that contextualization with type definitions is particularly impactful. After introducing our ideas in the context of Hazel we duplicate our techniques and port MVUBench to TypeScript in order to validate the applicability of these methods to higher-resource languages. Finally, we outline ChatLSP, a conservative extension to the Language Server Protocol (LSP) that language servers can implement to expose capabilities that AI code completion systems of various designs can use to incorporate static context when generating prompts for an LLM.

  • 4 authors
·
Sep 1, 2024 2

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.

  • 9 authors
·
May 29, 2025 2

Reinforcement Learning in the Era of LLMs: What is Essential? What is needed? An RL Perspective on RLHF, Prompting, and Beyond

Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). In this paper, we aim to link the research in conventional RL to RL techniques used in LLM research. Demystify this technique by discussing why, when, and how RL excels. Furthermore, we explore potential future avenues that could either benefit from or contribute to RLHF research. Highlighted Takeaways: 1. RLHF is Online Inverse RL with Offline Demonstration Data. 2. RLHF > SFT because Imitation Learning (and Inverse RL) > Behavior Cloning (BC) by alleviating the problem of compounding error. 3. The RM step in RLHF generates a proxy of the expensive human feedback, such an insight can be generalized to other LLM tasks such as prompting evaluation and optimization where feedback is also expensive. 4. The policy learning in RLHF is more challenging than conventional problems studied in IRL due to their high action dimensionality and feedback sparsity. 5. The main superiority of PPO over off-policy value-based methods is its stability gained from (almost) on-policy data and conservative policy updates.

  • 1 authors
·
Oct 9, 2023

Long-Horizon Model-Based Offline Reinforcement Learning Without Conservatism

Popular offline reinforcement learning (RL) methods rely on conservatism, either by penalizing out-of-dataset actions or by restricting rollout horizons. In this work, we question the universality of this principle and instead revisit a complementary one: a Bayesian perspective. Rather than enforcing conservatism, the Bayesian approach tackles epistemic uncertainty in offline data by modeling a posterior distribution over plausible world models and training a history-dependent agent to maximize expected rewards, enabling test-time generalization. We first illustrate, in a bandit setting, that Bayesianism excels on low-quality datasets where conservatism fails. We then scale this principle to realistic tasks and show that long-horizon planning is critical for reducing value overestimation once conservatism is removed. To make this feasible, we introduce key design choices for performing and learning from long-horizon rollouts while controlling compounding errors. These yield our algorithm, NEUBAY, grounded in the neutral Bayesian principle. On D4RL and NeoRL benchmarks, NEUBAY generally matches or surpasses leading conservative algorithms, achieving new state-of-the-art on 7 datasets. Notably, it succeeds with rollout horizons of several hundred steps, contrary to dominant practice. Finally, we characterize datasets by quality and coverage, showing when NEUBAY is preferable to conservative methods. Together, we argue NEUBAY lays the foundation for a new practical direction in offline and model-based RL.

  • 6 authors
·
Dec 3, 2025

Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions

As AI systems become increasingly integrated into human lives, endowing them with robust social intelligence has emerged as a critical frontier. A key aspect of this intelligence is discerning truth from deception, a ubiquitous element of human interaction that is conveyed through a complex interplay of verbal language and non-verbal visual cues. However, automatic deception detection in dynamic, multi-party conversations remains a significant challenge. The recent rise of powerful Multimodal Large Language Models (MLLMs), with their impressive abilities in visual and textual understanding, makes them natural candidates for this task. Consequently, their capabilities in this crucial domain are mostly unquantified. To address this gap, we introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and present a novel multimodal dataset derived from the social deduction game Werewolf. This dataset provides synchronized video, text, with verifiable ground-truth labels for every statement. We establish a comprehensive benchmark evaluating state-of-the-art MLLMs, revealing a significant performance gap: even powerful models like GPT-4o struggle to distinguish truth from falsehood reliably. Our analysis of failure modes indicates that these models fail to ground language in visual social cues effectively and may be overly conservative in their alignment, highlighting the urgent need for novel approaches to building more perceptive and trustworthy AI systems.

  • 5 authors
·
Oct 31, 2025

Analyzing Data Quality and Decay in Mega-Constellations: A Physics-Informed Machine Learning Approach

In the era of mega-constellations, the need for accurate and publicly available information has become fundamental for satellite operators to guarantee the safety of spacecrafts and the Low Earth Orbit (LEO) space environment. This study critically evaluates the accuracy and reliability of publicly available ephemeris data for a LEO mega-constellation - Starlink. The goal of this work is twofold: (i) compare and analyze the quality of the data against high-precision numerical propagation. (ii) Leverage Physics-Informed Machine Learning to extract relevant satellite quantities, such as non-conservative forces, during the decay process. By analyzing two months of real orbital data for approximately 1500 Starlink satellites, we identify discrepancies between high precision numerical algorithms and the published ephemerides, recognizing the use of simplified dynamics at fixed thresholds, planned maneuvers, and limitations in uncertainty propagations. Furthermore, we compare data obtained from multiple sources to track and analyze deorbiting satellites over the same period. Empirically, we extract the acceleration profile of satellites during deorbiting and provide insights relating to the effects of non-conservative forces during reentry. For non-deorbiting satellites, the position Root Mean Square Error (RMSE) was approximately 300 m, while for deorbiting satellites it increased to about 600 m. Through this in-depth analysis, we highlight potential limitations in publicly available data for accurate and robust Space Situational Awareness (SSA), and importantly, we propose a data-driven model of satellite decay in mega-constellations.

  • 3 authors
·
Oct 13, 2025

ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding

Video understanding is fundamental to tasks such as action recognition, video reasoning, and robotic control. Early video understanding methods based on large vision-language models (LVLMs) typically adopt a single-pass reasoning paradigm without dynamic feedback, limiting the model's capacity to self-correct and adapt in complex scenarios. Recent efforts have attempted to address this limitation by incorporating reward models and reinforcement learning to enhance reasoning, or by employing tool-agent frameworks. However, these approaches face several challenges, including high annotation costs, reward signals that fail to capture real-time reasoning states, and low inference efficiency. To overcome these issues, we propose ReAgent-V, a novel agentic video understanding framework that integrates efficient frame selection with real-time reward generation during inference. These reward signals not only guide iterative answer refinement through a multi-perspective reflection mechanism-adjusting predictions from conservative, neutral, and aggressive viewpoints-but also enable automatic filtering of high-quality data for supervised fine-tuning (SFT), direct preference optimization (DPO), and group relative policy optimization (GRPO). ReAgent-V is lightweight, modular, and extensible, supporting flexible tool integration tailored to diverse tasks. Extensive experiments on 12 datasets across three core applications-video understanding, video reasoning enhancement, and vision-language-action model alignment-demonstrate significant gains in generalization and reasoning, with improvements of up to 6.9%, 2.1%, and 9.8%, respectively, highlighting the effectiveness and versatility of the proposed framework.

  • 8 authors
·
Jun 2, 2025

Combating Partial Perception Deficit in Autonomous Driving with Multimodal LLM Commonsense

Partial perception deficits can compromise autonomous vehicle safety by disrupting environmental understanding. Current protocols typically respond with immediate stops or minimal-risk maneuvers, worsening traffic flow and lacking flexibility for rare driving scenarios. In this paper, we propose LLM-RCO, a framework leveraging large language models to integrate human-like driving commonsense into autonomous systems facing perception deficits. LLM-RCO features four key modules: hazard inference, short-term motion planner, action condition verifier, and safety constraint generator. These modules interact with the dynamic driving environment, enabling proactive and context-aware control actions to override the original control policy of autonomous agents. To improve safety in such challenging conditions, we construct DriveLM-Deficit, a dataset of 53,895 video clips featuring deficits of safety-critical objects, complete with annotations for LLM-based hazard inference and motion planning fine-tuning. Extensive experiments in adverse driving conditions with the CARLA simulator demonstrate that systems equipped with LLM-RCO significantly improve driving performance, highlighting its potential for enhancing autonomous driving resilience against adverse perception deficits. Our results also show that LLMs fine-tuned with DriveLM-Deficit can enable more proactive movements instead of conservative stops in the context of perception deficits.

  • 7 authors
·
Mar 10, 2025

Exploring HOD-dependent systematics for the DESI 2024 Full-Shape galaxy clustering analysis

We analyse the robustness of the DESI 2024 cosmological inference from fits to the full shape of the galaxy power spectrum to uncertainties in the Halo Occupation Distribution (HOD) model of the galaxy-halo connection and the choice of priors on nuisance parameters. We assess variations in the recovered cosmological parameters across a range of mocks populated with different HOD models and find that shifts are often greater than 20% of the expected statistical uncertainties from the DESI data. We encapsulate the effect of such shifts in terms of a systematic covariance term, C_{rm HOD}, and an additional diagonal contribution quantifying the impact of our choice of nuisance parameter priors on the ability of the effective field theory (EFT) model to correctly recover the cosmological parameters of the simulations. These two covariance contributions are designed to be added to the usual covariance term, C_{rm stat}, describing the statistical uncertainty in the power spectrum measurement, in order to fairly represent these sources of systematic uncertainty. This approach is more general and robust to choices of model free parameters or additional external datasets used in cosmological fits than the alternative approach of adding systematic uncertainties at the level of the recovered marginalised parameter posteriors. We compare the approaches within the context of a fixed LambdaCDM model and demonstrate that our method gives conservative estimates of the systematic uncertainty that nevertheless have little impact on the final posteriors obtained from DESI data.

  • 42 authors
·
Nov 18, 2024

Permissive Information-Flow Analysis for Large Language Models

Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, the traditional approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary input. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a k-nearest-neighbors language model. We compare these with the baseline of an introspection-based influence estimator that directly asks the language model to predict the output label. The results obtained highlight the superiority of our prompt-based label propagator, which improves the label in more than 85% of the cases in an LLM agent setting. These findings underscore the practicality of permissive label propagation for retrieval augmentation.

  • 10 authors
·
Oct 3, 2024

On gauge freedom, conservativity and intrinsic dimensionality estimation in diffusion models

Diffusion models are generative models that have recently demonstrated impressive performances in terms of sampling quality and density estimation in high dimensions. They rely on a forward continuous diffusion process and a backward continuous denoising process, which can be described by a time-dependent vector field and is used as a generative model. In the original formulation of the diffusion model, this vector field is assumed to be the score function (i.e. it is the gradient of the log-probability at a given time in the diffusion process). Curiously, on the practical side, most studies on diffusion models implement this vector field as a neural network function and do not constrain it be the gradient of some energy function (that is, most studies do not constrain the vector field to be conservative). Even though some studies investigated empirically whether such a constraint will lead to a performance gain, they lead to contradicting results and failed to provide analytical results. Here, we provide three analytical results regarding the extent of the modeling freedom of this vector field. {Firstly, we propose a novel decomposition of vector fields into a conservative component and an orthogonal component which satisfies a given (gauge) freedom. Secondly, from this orthogonal decomposition, we show that exact density estimation and exact sampling is achieved when the conservative component is exactly equals to the true score and therefore conservativity is neither necessary nor sufficient to obtain exact density estimation and exact sampling. Finally, we show that when it comes to inferring local information of the data manifold, constraining the vector field to be conservative is desirable.

  • 2 authors
·
Feb 6, 2024

Novel Quadratic Constraints for Extending LipSDP beyond Slope-Restricted Activations

Recently, semidefinite programming (SDP) techniques have shown great promise in providing accurate Lipschitz bounds for neural networks. Specifically, the LipSDP approach (Fazlyab et al., 2019) has received much attention and provides the least conservative Lipschitz upper bounds that can be computed with polynomial time guarantees. However, one main restriction of LipSDP is that its formulation requires the activation functions to be slope-restricted on [0,1], preventing its further use for more general activation functions such as GroupSort, MaxMin, and Householder. One can rewrite MaxMin activations for example as residual ReLU networks. However, a direct application of LipSDP to the resultant residual ReLU networks is conservative and even fails in recovering the well-known fact that the MaxMin activation is 1-Lipschitz. Our paper bridges this gap and extends LipSDP beyond slope-restricted activation functions. To this end, we provide novel quadratic constraints for GroupSort, MaxMin, and Householder activations via leveraging their underlying properties such as sum preservation. Our proposed analysis is general and provides a unified approach for estimating ell_2 and ell_infty Lipschitz bounds for a rich class of neural network architectures, including non-residual and residual neural networks and implicit models, with GroupSort, MaxMin, and Householder activations. Finally, we illustrate the utility of our approach with a variety of experiments and show that our proposed SDPs generate less conservative Lipschitz bounds in comparison to existing approaches.

  • 7 authors
·
Jan 25, 2024

Adversarial Training for High-Stakes Reliability

In the future, powerful AI systems may be deployed in high-stakes settings, where a single failure could be catastrophic. One technique for improving AI safety in high-stakes settings is adversarial training, which uses an adversary to generate examples to train on in order to achieve better worst-case performance. In this work, we used a safe language generation task (``avoid injuries'') as a testbed for achieving high reliability through adversarial training. We created a series of adversarial training techniques -- including a tool that assists human adversaries -- to find and eliminate failures in a classifier that filters text completions suggested by a generator. In our task, we determined that we can set very conservative classifier thresholds without significantly impacting the quality of the filtered outputs. We found that adversarial training increased robustness to the adversarial attacks that we trained on -- doubling the time for our contractors to find adversarial examples both with our tool (from 13 to 26 minutes) and without (from 20 to 44 minutes) -- without affecting in-distribution performance. We hope to see further work in the high-stakes reliability setting, including more powerful tools for enhancing human adversaries and better ways to measure high levels of reliability, until we can confidently rule out the possibility of catastrophic deployment-time failures of powerful models.

  • 12 authors
·
May 3, 2022

Parameter-free Online Test-time Adaptation

Training state-of-the-art vision models has become prohibitively expensive for researchers and practitioners. For the sake of accessibility and resource reuse, it is important to focus on adapting these models to a variety of downstream scenarios. An interesting and practical paradigm is online test-time adaptation, according to which training data is inaccessible, no labelled data from the test distribution is available, and adaptation can only happen at test time and on a handful of samples. In this paper, we investigate how test-time adaptation methods fare for a number of pre-trained models on a variety of real-world scenarios, significantly extending the way they have been originally evaluated. We show that they perform well only in narrowly-defined experimental setups and sometimes fail catastrophically when their hyperparameters are not selected for the same scenario in which they are being tested. Motivated by the inherent uncertainty around the conditions that will ultimately be encountered at test time, we propose a particularly "conservative" approach, which addresses the problem with a Laplacian Adjusted Maximum-likelihood Estimation (LAME) objective. By adapting the model's output (not its parameters), and solving our objective with an efficient concave-convex procedure, our approach exhibits a much higher average accuracy across scenarios than existing methods, while being notably faster and have a much lower memory footprint. The code is available at https://github.com/fiveai/LAME.

  • 4 authors
·
Jan 14, 2022