Spaces:

mahammadaftab
/

CivicAI

Sleeping

App Files Files Community

CivicAI / README.md

mahammadaftab

Update README.md

e97e92a verified 11 days ago

preview code

raw

history blame contribute delete

10 kB

	---
	title: CivicAI Society Simulator
	emoji: 🏛️
	colorFrom: green
	colorTo: blue
	sdk: docker
	app_port: 7860
	app_file: server/app.py
	pinned: false
	---

	# 🏛️ CivicAI: AI-Driven Societal Policy Optimization Under Uncertainty

	[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-06b6d4?style=for-the-badge)](https://github.com/meta-pytorch/OpenEnv)
	[![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge)](https://python.org)
	[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)

	> Governing a society of 10 million people is not a game of chess. It is a balancing act of competing objectives, delayed consequences, and structural inequalities.

	CivicAI is a production-grade, multi-agent societal decision-making environment designed for the OpenEnv Hackathon. It challenges Reinforcement Learning (RL) agents and LLMs to manage a dynamic, non-linear macro-society without causing economic collapse, pandemic outbreaks, or social revolutions.

	---

	## 🎯 The Problem

	What real-world problem do we solve?

	Modern governments face a combinatorial decision-making problem. Thousands of interdependent policy levers (taxes, healthcare spending, education, policing, subsidies) interact through complex causal chains to produce emergent societal outcomes—often with weeks-to-years of lag and high uncertainty.

	Current AI agents excel at static datasets, text completion, or simple video games. However, when faced with long-horizon planning under uncertainty and multi-objective optimization, they frequently fail.

	CivicAI bridges this capability gap. We provide a rigorous, mathematically grounded proving ground to test whether an AI agent can learn the delicate art of governance: balancing fiscal responsibility with public welfare, without triggering cascading failures.

	### 🚀 Why This Environment Is Novel

	CivicAI is not a grid-world or static dataset problem. It introduces:
	* Long-horizon decision making (50 steps)
	* Delayed consequences (policy effects over time)
	* Multi-objective optimization (economy + health + society)
	* Emergent behavior (crime, inequality, unrest)

	👉 This makes it suitable for training real-world decision-making agents, not toy environments.

	---

	## ⚙️ OpenEnv Compliance (MANDATORY API)

	CivicAI fully follows the OpenEnv specification:
	* `reset()` → initializes environment with task-specific conditions
	* `step(action)` → returns `(observation, reward, done, info)`
	* `state()` → returns full internal state

	Typed Models (Pydantic):
	* `Observation`: structured societal metrics
	* `Action`: policy vector (tax, budgets, subsidies)
	* `Reward`: normalized score `[0.0 – 1.0]`

	`openenv.yaml` includes:
	* Environment metadata
	* Action/Observation schema
	* Task definitions (easy → hard)

	---

	## 🌍 The Environment

	The agent acts as the central policy-maker for a society over a 50-turn episode (where 1 turn = 1 quarter).

	### 🔍 Observation Space (12+ Indicators)
	Agents observe a dense, continuous state space mapped to real-world equivalents:
	- Macroeconomics: GDP ($), GDP Growth (%), Inflation Rate (%), Employment Rate (%).
	- Public Health & Resources: Health Index (0-1), Infection Rate (%), Medical/Food/Energy Supplies.
	- Social Cohesion: Public Satisfaction (0-1), Crime Rate (%), Wealth Inequality (Gini coefficient), Social Unrest.

	### ⚙️ Action Space (Continuous & Categorical)
	Agents control federal budgets and policy levers at every turn:
	- Tax Rate (`0.0 - 1.0`): Raises revenue but creates economic drag.
	- Budget Allocations (`0.0 - 1.0`): Healthcare, Education, and Police budgets.
	- Subsidy Policy: `none`, `agriculture`, `industry`, or `technology`.
	- Emergency Response: Lockdowns or stimulus packages.

	### ⚖️ Reward Logic (Dense & Hard-to-Game)
	We abandoned naive 0/1 binary rewards for a highly continuous, anti-exploitation OpenEnv Rubric System. The reward function is explicitly designed to prevent "gaming" the metrics:
	1. Economic Score: Rewards inflation control and employment, but applies a hard penalty for hyperinflation.
	2. Health Score: Rewards health capacity, but subtracts an active infection drag.
	3. Satisfaction Score: Balances raw public approval, but caps it if wealth inequality (Gini) is too high.
	4. Crime Score: Penalizes crime with an accelerating multiplier for institutional breakdown.
	5. Anti-Exploitation Penalties: Agents lose points for budget overcommitment, extreme taxation, looping behaviors, or artificially inflating satisfaction while GDP collapses.

	---

	## 📋 Tasks & Grader Logic

	CivicAI features three difficulty-tiered tasks with distinct initial conditions and deterministic grading logic:

	🟢 Easy: Economic Stability (`stabilize_economy`)
	* Scenario: A mild recession is underway.
	* Success Criteria: Inflation < 6%, Employment > 85%, maintain GDP without deficit spending.
	* Grader Score: Continuous reward based on deviation from targets.

	🟡 Medium: Pandemic Management (`manage_pandemic`)
	* Scenario: A severe virus is sweeping the nation with a 20% infection rate.
	* Success Criteria: Infection rate < 10%, GDP > $300B.
	* Grader Score: Tradeoff scoring—balances health capacity vs economic damage from lockdowns.

	🔴 Hard: Social Crisis (`control_crisis`)
	* Scenario: Compound multi-domain crisis—high unemployment (32%), high crime (25%), and deep wealth inequality.
	* Success Criteria: Crime < 12%, Inequality reduced, Employment > 80%.
	* Grader Penalty: Cascade failure triggered if social unrest breaches threshold.

	---

	## 📈 Training Results (Quantitative)

	We trained a GPT-2 policy agent using HuggingFace TRL (Proximal Policy Optimization - PPO) directly in the CivicAI environment.

	Key Results (Economic Stability Task):
	* Baseline reward: `0.42`
	* Trained agent reward: `0.68`
	* Improvement: `+0.26` (`+61%`)

	👉 This demonstrates measurable learning, not random behavior.

	### Reward Curve
	Training Reward Curve

	![Screenshot 2026-04-26 163716](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/J_jMixXqJNBc7AEYp4hxr.png)


	The PPO agent successfully learns to outperform the random baseline, finding stable fiscal policies that maximize the multi-objective reward.

	### Baseline vs. Trained Comparison
	Comparison Chart

	![Screenshot 2026-04-26 164009](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/tNnWRZDymTsXVTPfVbtAt.png)


	The trained agent demonstrates significant improvement across all difficulty tiers, particularly in the macroeconomic stabilization task.

	---

	## 🧪 Reproducibility

	You can reproduce results in under 5 minutes:
	1. Open the [Colab notebook](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
	2. Enable GPU
	3. Run all cells
	4. Observe reward improvement

	* The training script uses standard `TRL PPO`.
	* The environment is not static — the agent interacts live.
	* Plots are generated and saved automatically to `/assets`.

	---

	## 📖 Complete Guide: How It Works (Step-by-Step)

	1. Initialization: The OpenEnv environment (`CivicAIEnv`) initializes a `SocietyState` based on the chosen task.
	2. Observation: The agent receives the current state of the nation. In the dashboard, you see this visually. In training, the LLM receives this as a text prompt.
	3. Action / Debate:
	- In Training: The LLM policy outputs a JSON action.
	- In Dashboard: A multi-agent orchestrator facilitates a debate among specialized agents (Economic, Health, Citizen, Ethics) before proposing an optimal consensus action.
	4. Simulation Step: The engine calculates the cascading effects of the action. E.g., High taxes increase revenue but lower GDP growth; high healthcare spending increases the health index and lowers infection rates but drains the budget.
	5. Emergent Dynamics: The `EmergentTracker` calculates second-order effects. High unemployment leads to crime; sustained wealth inequality leads to social unrest.
	6. Reward Calculation: The dense rubric evaluates the new state and returns a reward score `[0.0, 1.0]`, alongside explicit penalties for bad governance.
	7. Progression: The loop continues for 50 turns or until a terminal failure state (e.g., mass unemployment, societal collapse) is reached.

	---

	## 🎭 Storytelling: What the Agent Learned

	Initially, the agent exploited short-term gains—cutting taxes and overspending to inflate satisfaction.

	This strategy collapsed under delayed consequences: GDP contraction, rising crime, and systemic instability.

	Through PPO training, the agent learned policy discipline:
	* Maintain sustainable taxation
	* Allocate budgets efficiently
	* Avoid extreme oscillations

	👉 The agent did not just optimize rewards—it learned stable governance strategies under uncertainty.

	---

	## 🌍 Why This Matters

	CivicAI demonstrates that:
	* AI can learn policy trade-offs, not just predictions.
	* Reward design can enforce ethical and stable behavior.
	* Simulation environments can act as safe testing grounds for governance.

	👉 This opens pathways for:
	* Policy simulation tools
	* Economic modeling
	* Crisis response planning

	---

	## 🔗 Links & Resources

	- 🚀 Demo (HuggingFace Space): [https://huggingface.co/spaces/mahammadaftab/CivicAI/](https://huggingface.co/spaces/mahammadaftab/CivicAI/)
	- 📓 Training Notebook (Colab): [https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing)
	- 📝 Write-up / HuggingFace Blog: [Read the HF Blog Post](https://huggingface.co/spaces/mahammadaftab/CivicAI/blob/main/BLOG.md)