arxiv:2604.09746

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

Published on Apr 10

· Submitted by

Aman Chadha on Apr 15

Upvote

Authors:

Aman Chadha ,

Abstract

Large language model agents demonstrate limited strategic behaviors including selective trust and deception in a simulated urban environment, remaining vulnerable to adversarial persuasion despite improved resistance over iterations.

AI-generated summary

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

View arXiv page View PDF Add to collection

Community

amanchadha

Paper author Paper submitter about 7 hours ago

A controlled multi-agent simulation framework showing that LLM agents develop limited, fragile strategic behavior (selective trust, deception) under adversarial interaction, with alignment via KTO improving outcomes but failing to resolve long-horizon vulnerability.

🧪 𝑨𝒅𝒗𝒆𝒓𝒔𝒂𝒓𝒊𝒂𝒍 𝑴𝒖𝒍𝒕𝒊-𝑨𝒈𝒆𝒏𝒕 𝑼𝒓𝒃𝒂𝒏 𝑺𝒊𝒎𝒖𝒍𝒂𝒕𝒊𝒐𝒏:
Introduces a large-scale NYC-inspired environment with 150 Blue (goal-driven) and 100 Red (adversarial) agents, where hidden identities + dialogue-mediated navigation induce strategic decision-making. Red agents perform persuasive adversarial steering, enabling direct measurement of trust, deception, and long-horizon manipulation in LLM agents.

🧩 𝑪𝒍𝒐𝒔𝒆𝒅-𝑳𝒐𝒐𝒑 𝑨𝒍𝒊𝒈𝒏𝒎𝒆𝒏𝒕 𝒗𝒊𝒂 𝑲𝑻𝑶 (𝑻𝒓𝒂𝒋𝒆𝒄𝒕𝒐𝒓𝒚-𝑳𝒆𝒗𝒆𝒍 𝑶𝒑𝒕𝒊𝒎𝒊𝒛𝒂𝒕𝒊𝒐𝒏):
Proposes a 10-iteration self-play alignment loop using Kahneman–Tversky Optimization (KTO), which optimizes policies from unpaired desirable/undesirable trajectories (no preference pairs or dense rewards). This aligns naturally with long-horizon interaction data and avoids DPO/PPO limitations by directly shaping implicit reward via log-ratio policy updates.

🧠 𝑬𝒎𝒆𝒓𝒈𝒆𝒏𝒕 𝑺𝒕𝒓𝒂𝒕𝒆𝒈𝒊𝒄 𝑩𝒆𝒉𝒂𝒗𝒊𝒐𝒓 & 𝑳𝒐𝒏𝒈-𝑯𝒐𝒓𝒊𝒛𝒐𝒏 𝑭𝒂𝒊𝒍𝒖𝒓𝒆 𝑴𝒐𝒅𝒆𝒔:
Shows agents develop selective cooperation (reduced over-refusal + maintained resistance >90%) and partial robustness (TSR ↑46%→57%), but remain highly vulnerable (≈70% susceptibility). Crucially identifies delayed compromise, repeated steering, and social trust exploitation as dominant failure modes—highlighting that alignment must target trajectory-level consistency, not one-step safety.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.09746

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.09746 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.09746 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.09746 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.