Abstract
LiSA enables adaptive safety guardrails for AI agents by converting occasional failures into reusable policy abstractions and using evidence-aware confidence gating to improve performance under sparse and noisy feedback conditions.
As AI agents move from chat interfaces to systems that read private data, call tools, and execute multi-step workflows, guardrails become a last line of defense against concrete deployment harms. In these settings, guardrail failures are no longer merely answer-quality errors: they can leak secrets, authorize unsafe actions, or block legitimate work. The hardest failures are often contextual: whether an action is acceptable depends on local privacy norms, organizational policies, and user expectations that resist pre-deployment specification. This creates a practical gap: guardrails must adapt to their own operating environments, yet deployment feedback is typically limited to sparse, noisy user-reported failures, and repeated fine-tuning is often impractical. To address this gap, we propose LiSA (Lifelong Safety Adaptation), a conservative policy induction framework that improves a fixed base guardrail through structured memory. LiSA converts occasional failures into reusable policy abstractions so that sparse reports can generalize beyond individual cases, adds conflict-aware local rules to prevent overgeneralization in mixed-label contexts, and applies evidence-aware confidence gating via a posterior lower bound, so that memory reuse scales with accumulated evidence rather than empirical accuracy alone. Across PrivacyLens+, ConFaide+, and AgentHarm, LiSA consistently outperforms strong memory-based baselines under sparse feedback, remains robust under noisy user feedback even at 20% label-flip rates, and pushes the latency--performance frontier beyond backbone model scaling. Ultimately, LiSA offers a practical path to secure AI agents against the unpredictable long tail of real-world edge risks.
Community
As LLM agents read private data, call tools, and run multi-step workflows, guardrail failures stop being answer-quality issues — they become real harms: leaked secrets, unsafe actions, blocked legitimate work. And the hardest failures are contextual: local norms, org policies, evolving user expectations that no static guardrail can fully enumerate in advance.
Learning from deployment sounds obvious — but feedback is sparse, often noisy, and overreacting to a handful of cases easily breaks helpfulness or safety.
Three modules in LiSA:
① Broad policy abstraction — turn sparse failures into reusable policies
② Conflict-aware local policies — preserve boundary cues in mixed-label regions where a single broad rule would overgeneralize
③ Evidence-aware confidence gating — Beta posterior lower bound, so "validated once" ≠ "validated 100 times"
Results on PrivacyLens+, ConFaide+, AgentHarm:
✅ Beats strong memory baselines (ReasoningBank, Synapse, AGrail) under sparse feedback
✅ Stays stable even with 20% label-flip noise — gating is the key stabilizer
✅ Lightweight model + LiSA pushes the latency–performance frontier past larger un-adapted backbones
The takeaway: static guardrails can't anticipate the long tail; unconstrained adaptation overreaches. Conservative policy induction is the practical middle ground.
Get this paper in your agent:
hf papers read 2605.14454 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper