Papers
arxiv:2604.09746

CONSCIENTIA: Can LLM Agents Learn to Strategize? Emergent Deception and Trust in a Multi-Agent NYC Simulation

Published on Apr 10
Β· Submitted by
Aman Chadha
on Apr 15
Authors:
,
,
,
,
,
,
,
,

Abstract

Large language model agents demonstrate limited strategic behaviors including selective trust and deception in a simulated urban environment, remaining vulnerable to adversarial persuasion despite improved resistance over iterations.

AI-generated summary

As large language models (LLMs) are increasingly deployed as autonomous agents, understanding how strategic behavior emerges in multi-agent environments has become an important alignment challenge. We take a neutral empirical stance and construct a controlled environment in which strategic behavior can be directly observed and measured. We introduce a large-scale multi-agent simulation in a simplified model of New York City, where LLM-driven agents interact under opposing incentives. Blue agents aim to reach their destinations efficiently, while Red agents attempt to divert them toward billboard-heavy routes using persuasive language to maximize advertising revenue. Hidden identities make navigation socially mediated, forcing agents to decide when to trust or deceive. We study policy learning through an iterative simulation pipeline that updates agent policies across repeated interaction rounds using Kahneman-Tversky Optimization (KTO). Blue agents are optimized to reduce billboard exposure while preserving navigation efficiency, whereas Red agents adapt to exploit remaining weaknesses. Across iterations, the best Blue policy improves task success from 46.0% to 57.3%, although susceptibility remains high at 70.7%. Later policies exhibit stronger selective cooperation while preserving trajectory efficiency. However, a persistent safety-helpfulness trade-off remains: policies that better resist adversarial steering do not simultaneously maximize task completion. Overall, our results show that LLM agents can exhibit limited strategic behavior, including selective trust and deception, while remaining highly vulnerable to adversarial persuasion.

Community

Paper author Paper submitter

A controlled multi-agent simulation framework showing that LLM agents develop limited, fragile strategic behavior (selective trust, deception) under adversarial interaction, with alignment via KTO improving outcomes but failing to resolve long-horizon vulnerability.

πŸ§ͺ π‘¨π’…π’—π’†π’“π’”π’‚π’“π’Šπ’‚π’ π‘΄π’–π’π’•π’Š-π‘¨π’ˆπ’†π’π’• 𝑼𝒓𝒃𝒂𝒏 π‘Ίπ’Šπ’Žπ’–π’π’‚π’•π’Šπ’π’:
Introduces a large-scale NYC-inspired environment with 150 Blue (goal-driven) and 100 Red (adversarial) agents, where hidden identities + dialogue-mediated navigation induce strategic decision-making. Red agents perform persuasive adversarial steering, enabling direct measurement of trust, deception, and long-horizon manipulation in LLM agents.

🧩 π‘ͺ𝒍𝒐𝒔𝒆𝒅-𝑳𝒐𝒐𝒑 π‘¨π’π’Šπ’ˆπ’π’Žπ’†π’π’• π’—π’Šπ’‚ 𝑲𝑻𝑢 (π‘»π’“π’‚π’‹π’†π’„π’•π’π’“π’š-𝑳𝒆𝒗𝒆𝒍 π‘Άπ’‘π’•π’Šπ’Žπ’Šπ’›π’‚π’•π’Šπ’π’):
Proposes a 10-iteration self-play alignment loop using Kahneman–Tversky Optimization (KTO), which optimizes policies from unpaired desirable/undesirable trajectories (no preference pairs or dense rewards). This aligns naturally with long-horizon interaction data and avoids DPO/PPO limitations by directly shaping implicit reward via log-ratio policy updates.

🧠 π‘¬π’Žπ’†π’“π’ˆπ’†π’π’• π‘Ίπ’•π’“π’‚π’•π’†π’ˆπ’Šπ’„ π‘©π’†π’‰π’‚π’—π’Šπ’π’“ & π‘³π’π’π’ˆ-π‘―π’π’“π’Šπ’›π’π’ π‘­π’‚π’Šπ’π’–π’“π’† 𝑴𝒐𝒅𝒆𝒔:
Shows agents develop selective cooperation (reduced over-refusal + maintained resistance >90%) and partial robustness (TSR ↑46%β†’57%), but remain highly vulnerable (β‰ˆ70% susceptibility). Crucially identifies delayed compromise, repeated steering, and social trust exploitation as dominant failure modesβ€”highlighting that alignment must target trajectory-level consistency, not one-step safety.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.09746
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.09746 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.09746 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.09746 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.