CivicAI / README.md
mahammadaftab's picture
Update README.md
e97e92a verified
---
title: CivicAI Society Simulator
emoji: 🏛️
colorFrom: green
colorTo: blue
sdk: docker
app_port: 7860
app_file: server/app.py
pinned: false
---
# 🏛️ CivicAI: AI-Driven Societal Policy Optimization Under Uncertainty
[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-06b6d4?style=for-the-badge)](https://github.com/meta-pytorch/OpenEnv)
[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge)](https://python.org)
[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)
> **Governing a society of 10 million people is not a game of chess. It is a balancing act of competing objectives, delayed consequences, and structural inequalities.**
CivicAI is a production-grade, multi-agent societal decision-making environment designed for the **OpenEnv Hackathon**. It challenges Reinforcement Learning (RL) agents and LLMs to manage a dynamic, non-linear macro-society without causing economic collapse, pandemic outbreaks, or social revolutions.
---
## 🎯 The Problem
**What real-world problem do we solve?**
Modern governments face a combinatorial decision-making problem. Thousands of interdependent policy levers (taxes, healthcare spending, education, policing, subsidies) interact through complex causal chains to produce emergent societal outcomes—often with weeks-to-years of lag and high uncertainty.
Current AI agents excel at static datasets, text completion, or simple video games. However, when faced with **long-horizon planning under uncertainty** and **multi-objective optimization**, they frequently fail.
CivicAI bridges this capability gap. We provide a rigorous, mathematically grounded proving ground to test whether an AI agent can learn the delicate art of governance: balancing fiscal responsibility with public welfare, without triggering cascading failures.
### 🚀 Why This Environment Is Novel
CivicAI is not a grid-world or static dataset problem. It introduces:
* **Long-horizon decision making** (50 steps)
* **Delayed consequences** (policy effects over time)
* **Multi-objective optimization** (economy + health + society)
* **Emergent behavior** (crime, inequality, unrest)
👉 **This makes it suitable for training real-world decision-making agents, not toy environments.**
---
## ⚙️ OpenEnv Compliance (MANDATORY API)
CivicAI fully follows the OpenEnv specification:
* `reset()` → initializes environment with task-specific conditions
* `step(action)` → returns `(observation, reward, done, info)`
* `state()` → returns full internal state
**Typed Models (Pydantic):**
* `Observation`: structured societal metrics
* `Action`: policy vector (tax, budgets, subsidies)
* `Reward`: normalized score `[0.0 – 1.0]`
**`openenv.yaml` includes:**
* Environment metadata
* Action/Observation schema
* Task definitions (easy → hard)
---
## 🌍 The Environment
The agent acts as the central policy-maker for a society over a 50-turn episode (where 1 turn = 1 quarter).
### 🔍 Observation Space (12+ Indicators)
Agents observe a dense, continuous state space mapped to real-world equivalents:
- **Macroeconomics:** GDP ($), GDP Growth (%), Inflation Rate (%), Employment Rate (%).
- **Public Health & Resources:** Health Index (0-1), Infection Rate (%), Medical/Food/Energy Supplies.
- **Social Cohesion:** Public Satisfaction (0-1), Crime Rate (%), Wealth Inequality (Gini coefficient), Social Unrest.
### ⚙️ Action Space (Continuous & Categorical)
Agents control federal budgets and policy levers at every turn:
- **Tax Rate** (`0.0 - 1.0`): Raises revenue but creates economic drag.
- **Budget Allocations** (`0.0 - 1.0`): Healthcare, Education, and Police budgets.
- **Subsidy Policy**: `none`, `agriculture`, `industry`, or `technology`.
- **Emergency Response**: Lockdowns or stimulus packages.
### ⚖️ Reward Logic (Dense & Hard-to-Game)
We abandoned naive 0/1 binary rewards for a **highly continuous, anti-exploitation OpenEnv Rubric System**. The reward function is explicitly designed to prevent "gaming" the metrics:
1. **Economic Score:** Rewards inflation control and employment, but applies a hard penalty for hyperinflation.
2. **Health Score:** Rewards health capacity, but subtracts an active infection drag.
3. **Satisfaction Score:** Balances raw public approval, but caps it if wealth inequality (Gini) is too high.
4. **Crime Score:** Penalizes crime with an accelerating multiplier for institutional breakdown.
5. **Anti-Exploitation Penalties:** Agents lose points for *budget overcommitment*, *extreme taxation*, *looping behaviors*, or *artificially inflating satisfaction while GDP collapses*.
---
## 📋 Tasks & Grader Logic
CivicAI features three difficulty-tiered tasks with distinct initial conditions and deterministic grading logic:
**🟢 Easy: Economic Stability (`stabilize_economy`)**
* **Scenario:** A mild recession is underway.
* **Success Criteria:** Inflation < 6%, Employment > 85%, maintain GDP without deficit spending.
* **Grader Score:** Continuous reward based on deviation from targets.
**🟡 Medium: Pandemic Management (`manage_pandemic`)**
* **Scenario:** A severe virus is sweeping the nation with a 20% infection rate.
* **Success Criteria:** Infection rate < 10%, GDP > $300B.
* **Grader Score:** Tradeoff scoring—balances health capacity vs economic damage from lockdowns.
**🔴 Hard: Social Crisis (`control_crisis`)**
* **Scenario:** Compound multi-domain crisis—high unemployment (32%), high crime (25%), and deep wealth inequality.
* **Success Criteria:** Crime < 12%, Inequality reduced, Employment > 80%.
* **Grader Penalty:** Cascade failure triggered if social unrest breaches threshold.
---
## 📈 Training Results (Quantitative)
We trained a GPT-2 policy agent using HuggingFace TRL (Proximal Policy Optimization - PPO) directly in the CivicAI environment.
**Key Results (Economic Stability Task):**
* **Baseline reward:** `0.42`
* **Trained agent reward:** `0.68`
* **Improvement:** `+0.26` (`+61%`)
👉 **This demonstrates measurable learning, not random behavior.**
### Reward Curve
Training Reward Curve
![Screenshot 2026-04-26 163716](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/J_jMixXqJNBc7AEYp4hxr.png)
*The PPO agent successfully learns to outperform the random baseline, finding stable fiscal policies that maximize the multi-objective reward.*
### Baseline vs. Trained Comparison
Comparison Chart
![Screenshot 2026-04-26 164009](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/tNnWRZDymTsXVTPfVbtAt.png)
*The trained agent demonstrates significant improvement across all difficulty tiers, particularly in the macroeconomic stabilization task.*
---
## 🧪 Reproducibility
**You can reproduce results in under 5 minutes:**
1. Open the [Colab notebook](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
2. Enable GPU
3. Run all cells
4. Observe reward improvement
* The training script uses standard `TRL PPO`.
* The environment is not static — the agent interacts live.
* Plots are generated and saved automatically to `/assets`.
---
## 📖 Complete Guide: How It Works (Step-by-Step)
1. **Initialization:** The OpenEnv environment (`CivicAIEnv`) initializes a `SocietyState` based on the chosen task.
2. **Observation:** The agent receives the current state of the nation. In the dashboard, you see this visually. In training, the LLM receives this as a text prompt.
3. **Action / Debate:**
- *In Training:* The LLM policy outputs a JSON action.
- *In Dashboard:* A multi-agent orchestrator facilitates a debate among specialized agents (Economic, Health, Citizen, Ethics) before proposing an optimal consensus action.
4. **Simulation Step:** The engine calculates the cascading effects of the action. E.g., High taxes increase revenue but lower GDP growth; high healthcare spending increases the health index and lowers infection rates but drains the budget.
5. **Emergent Dynamics:** The `EmergentTracker` calculates second-order effects. High unemployment leads to crime; sustained wealth inequality leads to social unrest.
6. **Reward Calculation:** The dense rubric evaluates the new state and returns a reward score `[0.0, 1.0]`, alongside explicit penalties for bad governance.
7. **Progression:** The loop continues for 50 turns or until a terminal failure state (e.g., mass unemployment, societal collapse) is reached.
---
## 🎭 Storytelling: What the Agent Learned
Initially, the agent exploited short-term gains—cutting taxes and overspending to inflate satisfaction.
This strategy collapsed under delayed consequences: GDP contraction, rising crime, and systemic instability.
Through PPO training, the agent learned policy discipline:
* Maintain sustainable taxation
* Allocate budgets efficiently
* Avoid extreme oscillations
👉 **The agent did not just optimize rewards—it learned stable governance strategies under uncertainty.**
---
## 🌍 Why This Matters
CivicAI demonstrates that:
* **AI can learn policy trade-offs**, not just predictions.
* **Reward design can enforce ethical and stable behavior.**
* **Simulation environments can act as safe testing grounds** for governance.
👉 **This opens pathways for:**
* Policy simulation tools
* Economic modeling
* Crisis response planning
---
## 🔗 Links & Resources
- 🚀 **Demo (HuggingFace Space):** [https://huggingface.co/spaces/mahammadaftab/CivicAI/](https://huggingface.co/spaces/mahammadaftab/CivicAI/)
- 📓 **Training Notebook (Colab):** [https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
- 📝 **Write-up / HuggingFace Blog:** [Read the HF Blog Post](https://huggingface.co/spaces/mahammadaftab/CivicAI/blob/main/BLOG.md)