CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation
Abstract
Large language model agents demonstrate limited strategic behaviors including selective trust and deception in a simulated urban environment, remaining vulnerable to adversarial persuasion despite improved resistance over iterations.
As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.
Community
A controlled multi-agent simulation framework showing that LLM agents develop limited, fragile strategic behavior (selective trust, deception) under adversarial interaction, with alignment via KTO improving outcomes but failing to resolve long-horizon vulnerability.
π§ͺ π¨π
πππππππππ π΄ππππ-π¨ππππ πΌππππ πΊπππππππππ:
Introduces a large-scale NYC-inspired environment with 150 Blue (goal-driven) and 100 Red (adversarial) agents, where hidden identities + dialogue-mediated navigation induce strategic decision-making. Red agents perform persuasive adversarial steering, enabling direct measurement of trust, deception, and long-horizon manipulation in LLM agents.
π§© πͺπππππ
-π³πππ π¨ππππππππ πππ π²π»πΆ (π»πππππππππ-π³ππππ πΆπππππππππππ):
Proposes a 10-iteration self-play alignment loop using KahnemanβTversky Optimization (KTO), which optimizes policies from unpaired desirable/undesirable trajectories (no preference pairs or dense rewards). This aligns naturally with long-horizon interaction data and avoids DPO/PPO limitations by directly shaping implicit reward via log-ratio policy updates.
π§ π¬πππππππ πΊππππππππ π©πππππππ & π³πππ-π―ππππππ πππππππ π΄ππ
ππ:
Shows agents develop selective cooperation (reduced over-refusal + maintained resistance >90%) and partial robustness (TSR β46%β57%), but remain highly vulnerable (β70% susceptibility). Crucially identifies delayed compromise, repeated steering, and social trust exploitation as dominant failure modesβhighlighting that alignment must target trajectory-level consistency, not one-step safety.
Get this paper in your agent:
hf papers read 2604.09746 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper