---
title: CivicAI Society Simulator
emoji: 🏛️
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
app_file: server/app.py
pinned: false
---

# 🏛️ CivicAI: AI-Driven Societal Policy Optimization Under Uncertainty

[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-06b6d4?style=for-the-badge)](https://github.com/meta-pytorch/OpenEnv)
[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge)](https://python.org)
[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)

> **Governing a society of 10 million people is not a game of chess. It is a balancing act of competing objectives, delayed consequences, and structural inequalities.**

CivicAI is a production-grade, multi-agent societal decision-making environment designed for the **OpenEnv Hackathon**. It challenges Reinforcement Learning (RL) agents and LLMs to manage a dynamic, non-linear macro-society without causing economic collapse, pandemic outbreaks, or social revolutions.

---

## 🎯 The Problem

**What real-world problem do we solve?**

Modern governments face a combinatorial decision-making problem. Thousands of interdependent policy levers (taxes, healthcare spending, education, policing, subsidies) interact through complex causal chains to produce emergent societal outcomes—often with weeks-to-years of lag and high uncertainty.

Current AI agents excel at static datasets, text completion, or simple video games. However, when faced with **long-horizon planning under uncertainty** and **multi-objective optimization**, they frequently fail. 

CivicAI bridges this capability gap. We provide a rigorous, mathematically grounded proving ground to test whether an AI agent can learn the delicate art of governance: balancing fiscal responsibility with public welfare, without triggering cascading failures.

### 🚀 Why This Environment Is Novel

CivicAI is not a grid-world or static dataset problem. It introduces:
*   **Long-horizon decision making** (50 steps)
*   **Delayed consequences** (policy effects over time)
*   **Multi-objective optimization** (economy + health + society)
*   **Emergent behavior** (crime, inequality, unrest)

👉 **This makes it suitable for training real-world decision-making agents, not toy environments.**

---

## ⚙️ OpenEnv Compliance (MANDATORY API)

CivicAI fully follows the OpenEnv specification:
*   `reset()` → initializes environment with task-specific conditions
*   `step(action)` → returns `(observation, reward, done, info)`
*   `state()` → returns full internal state

**Typed Models (Pydantic):**
*   `Observation`: structured societal metrics
*   `Action`: policy vector (tax, budgets, subsidies)
*   `Reward`: normalized score `[0.0 – 1.0]`

**`openenv.yaml` includes:**
*   Environment metadata
*   Action/Observation schema
*   Task definitions (easy → hard)

---

## 🌍 The Environment

The agent acts as the central policy-maker for a society over a 50-turn episode (where 1 turn = 1 quarter).

### 🔍 Observation Space (12+ Indicators)
Agents observe a dense, continuous state space mapped to real-world equivalents:
- **Macroeconomics:** GDP ($), GDP Growth (%), Inflation Rate (%), Employment Rate (%).
- **Public Health & Resources:** Health Index (0-1), Infection Rate (%), Medical/Food/Energy Supplies.
- **Social Cohesion:** Public Satisfaction (0-1), Crime Rate (%), Wealth Inequality (Gini coefficient), Social Unrest.

### ⚙️ Action Space (Continuous & Categorical)
Agents control federal budgets and policy levers at every turn:
- **Tax Rate** (`0.0 - 1.0`): Raises revenue but creates economic drag.
- **Budget Allocations** (`0.0 - 1.0`): Healthcare, Education, and Police budgets.
- **Subsidy Policy**: `none`, `agriculture`, `industry`, or `technology`.
- **Emergency Response**: Lockdowns or stimulus packages.

### ⚖️ Reward Logic (Dense & Hard-to-Game)
We abandoned naive 0/1 binary rewards for a **highly continuous, anti-exploitation OpenEnv Rubric System**. The reward function is explicitly designed to prevent "gaming" the metrics:
1. **Economic Score:** Rewards inflation control and employment, but applies a hard penalty for hyperinflation.
2. **Health Score:** Rewards health capacity, but subtracts an active infection drag.
3. **Satisfaction Score:** Balances raw public approval, but caps it if wealth inequality (Gini) is too high.
4. **Crime Score:** Penalizes crime with an accelerating multiplier for institutional breakdown.
5. **Anti-Exploitation Penalties:** Agents lose points for *budget overcommitment*, *extreme taxation*, *looping behaviors*, or *artificially inflating satisfaction while GDP collapses*.

---

## 📋 Tasks & Grader Logic

CivicAI features three difficulty-tiered tasks with distinct initial conditions and deterministic grading logic:

**🟢 Easy: Economic Stability (`stabilize_economy`)**
*   **Scenario:** A mild recession is underway.
*   **Success Criteria:** Inflation < 6%, Employment > 85%, maintain GDP without deficit spending.
*   **Grader Score:** Continuous reward based on deviation from targets.

**🟡 Medium: Pandemic Management (`manage_pandemic`)**
*   **Scenario:** A severe virus is sweeping the nation with a 20% infection rate.
*   **Success Criteria:** Infection rate < 10%, GDP > $300B.
*   **Grader Score:** Tradeoff scoring—balances health capacity vs economic damage from lockdowns.

**🔴 Hard: Social Crisis (`control_crisis`)**
*   **Scenario:** Compound multi-domain crisis—high unemployment (32%), high crime (25%), and deep wealth inequality.
*   **Success Criteria:** Crime < 12%, Inequality reduced, Employment > 80%.
*   **Grader Penalty:** Cascade failure triggered if social unrest breaches threshold.

---

## 📈 Training Results (Quantitative)

We trained a GPT-2 policy agent using HuggingFace TRL (Proximal Policy Optimization - PPO) directly in the CivicAI environment. 

**Key Results (Economic Stability Task):**
*   **Baseline reward:** `0.42`
*   **Trained agent reward:** `0.68`
*   **Improvement:** `+0.26` (`+61%`)

👉 **This demonstrates measurable learning, not random behavior.**

### Reward Curve
Training Reward Curve

![Screenshot 2026-04-26 163716](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/J_jMixXqJNBc7AEYp4hxr.png)


*The PPO agent successfully learns to outperform the random baseline, finding stable fiscal policies that maximize the multi-objective reward.*

### Baseline vs. Trained Comparison
Comparison Chart

![Screenshot 2026-04-26 164009](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/tNnWRZDymTsXVTPfVbtAt.png)


*The trained agent demonstrates significant improvement across all difficulty tiers, particularly in the macroeconomic stabilization task.*

---

## 🧪 Reproducibility

**You can reproduce results in under 5 minutes:**
1. Open the [Colab notebook](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
2. Enable GPU
3. Run all cells
4. Observe reward improvement

*   The training script uses standard `TRL PPO`.
*   The environment is not static — the agent interacts live.
*   Plots are generated and saved automatically to `/assets`.

---

## 📖 Complete Guide: How It Works (Step-by-Step)

1. **Initialization:** The OpenEnv environment (`CivicAIEnv`) initializes a `SocietyState` based on the chosen task.
2. **Observation:** The agent receives the current state of the nation. In the dashboard, you see this visually. In training, the LLM receives this as a text prompt.
3. **Action / Debate:** 
   - *In Training:* The LLM policy outputs a JSON action.
   - *In Dashboard:* A multi-agent orchestrator facilitates a debate among specialized agents (Economic, Health, Citizen, Ethics) before proposing an optimal consensus action.
4. **Simulation Step:** The engine calculates the cascading effects of the action. E.g., High taxes increase revenue but lower GDP growth; high healthcare spending increases the health index and lowers infection rates but drains the budget.
5. **Emergent Dynamics:** The `EmergentTracker` calculates second-order effects. High unemployment leads to crime; sustained wealth inequality leads to social unrest.
6. **Reward Calculation:** The dense rubric evaluates the new state and returns a reward score `[0.0, 1.0]`, alongside explicit penalties for bad governance.
7. **Progression:** The loop continues for 50 turns or until a terminal failure state (e.g., mass unemployment, societal collapse) is reached.

---

## 🎭 Storytelling: What the Agent Learned

Initially, the agent exploited short-term gains—cutting taxes and overspending to inflate satisfaction.

This strategy collapsed under delayed consequences: GDP contraction, rising crime, and systemic instability.

Through PPO training, the agent learned policy discipline:
*   Maintain sustainable taxation
*   Allocate budgets efficiently
*   Avoid extreme oscillations

👉 **The agent did not just optimize rewards—it learned stable governance strategies under uncertainty.**

---

## 🌍 Why This Matters

CivicAI demonstrates that:
*   **AI can learn policy trade-offs**, not just predictions.
*   **Reward design can enforce ethical and stable behavior.**
*   **Simulation environments can act as safe testing grounds** for governance.

👉 **This opens pathways for:**
*   Policy simulation tools
*   Economic modeling
*   Crisis response planning

---

## 🔗 Links & Resources

- 🚀 **Demo (HuggingFace Space):** [https://huggingface.co/spaces/mahammadaftab/CivicAI/](https://huggingface.co/spaces/mahammadaftab/CivicAI/)
- 📓 **Training Notebook (Colab):** [https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
- 📝 **Write-up / HuggingFace Blog:** [Read the HF Blog Post](https://huggingface.co/spaces/mahammadaftab/CivicAI/blob/main/BLOG.md)