arxiv:2605.14454

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction

Published on May 14

· Submitted by

minbeomkim on May 15

Google

Upvote

Authors:

Abstract

LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions.

AI-generated summary

As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.

View arXiv page View PDF Add to collection

Community

mbkim

Paper submitter about 9 hours ago

As LLM agents read private data, call tools, and run multi-step workflows, guardrail failures stop being answer-quality issues — they become real harms: leaked secrets, unsafe actions, blocked legitimate work. And the hardest failures are contextual: local norms, org policies, evolving user expectations that no static guardrail can fully enumerate in advance.

Learning from deployment sounds obvious — but feedback is sparse, often noisy, and overreacting to a handful of cases easily breaks helpfulness or safety.

Three modules in LiSA:
① Broad policy abstraction — turn sparse failures into reusable policies
② Conflict-aware local policies — preserve boundary cues in mixed-label regions where a single broad rule would overgeneralize
③ Evidence-aware confidence gating — Beta posterior lower bound, so "validated once" ≠ "validated 100 times"

Results on PrivacyLens+, ConFaide+, AgentHarm:
✅ Beats strong memory baselines (ReasoningBank, Synapse, AGrail) under sparse feedback
✅ Stays stable even with 20% label-flip noise — gating is the key stabilizer
✅ Lightweight model + LiSA pushes the latency–performance frontier past larger un-adapted backbones

The takeaway: static guardrails can't anticipate the long tail; unconstrained adaptation overreaches. Conservative policy induction is the practical middle ground.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.14454

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.14454 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.14454 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.14454 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.