File size: 10,004 Bytes
ea1f899
 
 
 
 
 
 
6298125
ea1f899
 
 
6298125
315caa2
 
 
 
 
6298125
 
 
 
315caa2
 
6298125
315caa2
6298125
315caa2
6298125
315caa2
6298125
315caa2
6298125
315caa2
6298125
315caa2
6298125
 
 
 
 
315caa2
6298125
315caa2
6298125
315caa2
6298125
315caa2
6298125
 
 
 
315caa2
6298125
 
 
 
315caa2
6298125
 
 
 
315caa2
7415e01
315caa2
6298125
315caa2
6298125
315caa2
6298125
 
 
 
 
315caa2
6298125
 
 
 
 
 
315caa2
6298125
 
 
 
 
 
 
315caa2
 
 
6298125
315caa2
6298125
315caa2
6298125
 
 
 
315caa2
6298125
 
 
 
315caa2
6298125
 
 
 
315caa2
6298125
 
 
315caa2
6298125
315caa2
6298125
 
 
 
315caa2
6298125
315caa2
6298125
e97e92a
04588be
 
 
e97e92a
6298125
315caa2
6298125
e97e92a
04588be
 
 
e97e92a
6298125
315caa2
 
 
6298125
315caa2
6298125
04588be
6298125
 
 
 
 
 
 
315caa2
 
 
6298125
315caa2
6298125
 
 
 
 
 
 
 
 
315caa2
 
 
6298125
 
 
 
 
 
 
 
 
 
 
 
315caa2
 
 
6298125
 
 
 
 
 
315caa2
6298125
 
 
 
315caa2
 
 
6298125
315caa2
04588be
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: CivicAI Society Simulator
emoji: 🏛️
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
app_file: server/app.py
pinned: false
---

# 🏛️ CivicAI: AI-Driven Societal Policy Optimization Under Uncertainty

[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-06b6d4?style=for-the-badge)](https://github.com/meta-pytorch/OpenEnv)
[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge)](https://python.org)
[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)

> **Governing a society of 10 million people is not a game of chess. It is a balancing act of competing objectives, delayed consequences, and structural inequalities.**

CivicAI is a production-grade, multi-agent societal decision-making environment designed for the **OpenEnv Hackathon**. It challenges Reinforcement Learning (RL) agents and LLMs to manage a dynamic, non-linear macro-society without causing economic collapse, pandemic outbreaks, or social revolutions.

---

## 🎯 The Problem

**What real-world problem do we solve?**

Modern governments face a combinatorial decision-making problem. Thousands of interdependent policy levers (taxes, healthcare spending, education, policing, subsidies) interact through complex causal chains to produce emergent societal outcomes—often with weeks-to-years of lag and high uncertainty.

Current AI agents excel at static datasets, text completion, or simple video games. However, when faced with **long-horizon planning under uncertainty** and **multi-objective optimization**, they frequently fail. 

CivicAI bridges this capability gap. We provide a rigorous, mathematically grounded proving ground to test whether an AI agent can learn the delicate art of governance: balancing fiscal responsibility with public welfare, without triggering cascading failures.

### 🚀 Why This Environment Is Novel

CivicAI is not a grid-world or static dataset problem. It introduces:
*   **Long-horizon decision making** (50 steps)
*   **Delayed consequences** (policy effects over time)
*   **Multi-objective optimization** (economy + health + society)
*   **Emergent behavior** (crime, inequality, unrest)

👉 **This makes it suitable for training real-world decision-making agents, not toy environments.**

---

## ⚙️ OpenEnv Compliance (MANDATORY API)

CivicAI fully follows the OpenEnv specification:
*   `reset()` → initializes environment with task-specific conditions
*   `step(action)` → returns `(observation, reward, done, info)`
*   `state()` → returns full internal state

**Typed Models (Pydantic):**
*   `Observation`: structured societal metrics
*   `Action`: policy vector (tax, budgets, subsidies)
*   `Reward`: normalized score `[0.0 – 1.0]`

**`openenv.yaml` includes:**
*   Environment metadata
*   Action/Observation schema
*   Task definitions (easy → hard)

---

## 🌍 The Environment

The agent acts as the central policy-maker for a society over a 50-turn episode (where 1 turn = 1 quarter).

### 🔍 Observation Space (12+ Indicators)
Agents observe a dense, continuous state space mapped to real-world equivalents:
- **Macroeconomics:** GDP ($), GDP Growth (%), Inflation Rate (%), Employment Rate (%).
- **Public Health & Resources:** Health Index (0-1), Infection Rate (%), Medical/Food/Energy Supplies.
- **Social Cohesion:** Public Satisfaction (0-1), Crime Rate (%), Wealth Inequality (Gini coefficient), Social Unrest.

### ⚙️ Action Space (Continuous & Categorical)
Agents control federal budgets and policy levers at every turn:
- **Tax Rate** (`0.0 - 1.0`): Raises revenue but creates economic drag.
- **Budget Allocations** (`0.0 - 1.0`): Healthcare, Education, and Police budgets.
- **Subsidy Policy**: `none`, `agriculture`, `industry`, or `technology`.
- **Emergency Response**: Lockdowns or stimulus packages.

### ⚖️ Reward Logic (Dense & Hard-to-Game)
We abandoned naive 0/1 binary rewards for a **highly continuous, anti-exploitation OpenEnv Rubric System**. The reward function is explicitly designed to prevent "gaming" the metrics:
1. **Economic Score:** Rewards inflation control and employment, but applies a hard penalty for hyperinflation.
2. **Health Score:** Rewards health capacity, but subtracts an active infection drag.
3. **Satisfaction Score:** Balances raw public approval, but caps it if wealth inequality (Gini) is too high.
4. **Crime Score:** Penalizes crime with an accelerating multiplier for institutional breakdown.
5. **Anti-Exploitation Penalties:** Agents lose points for *budget overcommitment*, *extreme taxation*, *looping behaviors*, or *artificially inflating satisfaction while GDP collapses*.

---

## 📋 Tasks & Grader Logic

CivicAI features three difficulty-tiered tasks with distinct initial conditions and deterministic grading logic:

**🟢 Easy: Economic Stability (`stabilize_economy`)**
*   **Scenario:** A mild recession is underway.
*   **Success Criteria:** Inflation < 6%, Employment > 85%, maintain GDP without deficit spending.
*   **Grader Score:** Continuous reward based on deviation from targets.

**🟡 Medium: Pandemic Management (`manage_pandemic`)**
*   **Scenario:** A severe virus is sweeping the nation with a 20% infection rate.
*   **Success Criteria:** Infection rate < 10%, GDP > $300B.
*   **Grader Score:** Tradeoff scoring—balances health capacity vs economic damage from lockdowns.

**🔴 Hard: Social Crisis (`control_crisis`)**
*   **Scenario:** Compound multi-domain crisis—high unemployment (32%), high crime (25%), and deep wealth inequality.
*   **Success Criteria:** Crime < 12%, Inequality reduced, Employment > 80%.
*   **Grader Penalty:** Cascade failure triggered if social unrest breaches threshold.

---

## 📈 Training Results (Quantitative)

We trained a GPT-2 policy agent using HuggingFace TRL (Proximal Policy Optimization - PPO) directly in the CivicAI environment. 

**Key Results (Economic Stability Task):**
*   **Baseline reward:** `0.42`
*   **Trained agent reward:** `0.68`
*   **Improvement:** `+0.26` (`+61%`)

👉 **This demonstrates measurable learning, not random behavior.**

### Reward Curve
Training Reward Curve

![Screenshot 2026-04-26 163716](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/J_jMixXqJNBc7AEYp4hxr.png)


*The PPO agent successfully learns to outperform the random baseline, finding stable fiscal policies that maximize the multi-objective reward.*

### Baseline vs. Trained Comparison
Comparison Chart

![Screenshot 2026-04-26 164009](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/tNnWRZDymTsXVTPfVbtAt.png)


*The trained agent demonstrates significant improvement across all difficulty tiers, particularly in the macroeconomic stabilization task.*

---

## 🧪 Reproducibility

**You can reproduce results in under 5 minutes:**
1. Open the [Colab notebook](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
2. Enable GPU
3. Run all cells
4. Observe reward improvement

*   The training script uses standard `TRL PPO`.
*   The environment is not static — the agent interacts live.
*   Plots are generated and saved automatically to `/assets`.

---

## 📖 Complete Guide: How It Works (Step-by-Step)

1. **Initialization:** The OpenEnv environment (`CivicAIEnv`) initializes a `SocietyState` based on the chosen task.
2. **Observation:** The agent receives the current state of the nation. In the dashboard, you see this visually. In training, the LLM receives this as a text prompt.
3. **Action / Debate:** 
   - *In Training:* The LLM policy outputs a JSON action.
   - *In Dashboard:* A multi-agent orchestrator facilitates a debate among specialized agents (Economic, Health, Citizen, Ethics) before proposing an optimal consensus action.
4. **Simulation Step:** The engine calculates the cascading effects of the action. E.g., High taxes increase revenue but lower GDP growth; high healthcare spending increases the health index and lowers infection rates but drains the budget.
5. **Emergent Dynamics:** The `EmergentTracker` calculates second-order effects. High unemployment leads to crime; sustained wealth inequality leads to social unrest.
6. **Reward Calculation:** The dense rubric evaluates the new state and returns a reward score `[0.0, 1.0]`, alongside explicit penalties for bad governance.
7. **Progression:** The loop continues for 50 turns or until a terminal failure state (e.g., mass unemployment, societal collapse) is reached.

---

## 🎭 Storytelling: What the Agent Learned

Initially, the agent exploited short-term gains—cutting taxes and overspending to inflate satisfaction.

This strategy collapsed under delayed consequences: GDP contraction, rising crime, and systemic instability.

Through PPO training, the agent learned policy discipline:
*   Maintain sustainable taxation
*   Allocate budgets efficiently
*   Avoid extreme oscillations

👉 **The agent did not just optimize rewards—it learned stable governance strategies under uncertainty.**

---

## 🌍 Why This Matters

CivicAI demonstrates that:
*   **AI can learn policy trade-offs**, not just predictions.
*   **Reward design can enforce ethical and stable behavior.**
*   **Simulation environments can act as safe testing grounds** for governance.

👉 **This opens pathways for:**
*   Policy simulation tools
*   Economic modeling
*   Crisis response planning

---

## 🔗 Links & Resources

- 🚀 **Demo (HuggingFace Space):** [https://huggingface.co/spaces/mahammadaftab/CivicAI/](https://huggingface.co/spaces/mahammadaftab/CivicAI/)
- 📓 **Training Notebook (Colab):** [https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
- 📝 **Write-up / HuggingFace Blog:** [Read the HF Blog Post](https://huggingface.co/spaces/mahammadaftab/CivicAI/blob/main/BLOG.md)