Spaces:
Sleeping
Sleeping
| title: CivicAI Society Simulator | |
| emoji: 🏛️ | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| app_file: server/app.py | |
| pinned: false | |
| # 🏛️ CivicAI: AI-Driven Societal Policy Optimization Under Uncertainty | |
| [](https://github.com/meta-pytorch/OpenEnv) | |
| [](https://python.org) | |
| [](LICENSE) | |
| > **Governing a society of 10 million people is not a game of chess. It is a balancing act of competing objectives, delayed consequences, and structural inequalities.** | |
| CivicAI is a production-grade, multi-agent societal decision-making environment designed for the **OpenEnv Hackathon**. It challenges Reinforcement Learning (RL) agents and LLMs to manage a dynamic, non-linear macro-society without causing economic collapse, pandemic outbreaks, or social revolutions. | |
| --- | |
| ## 🎯 The Problem | |
| **What real-world problem do we solve?** | |
| Modern governments face a combinatorial decision-making problem. Thousands of interdependent policy levers (taxes, healthcare spending, education, policing, subsidies) interact through complex causal chains to produce emergent societal outcomes—often with weeks-to-years of lag and high uncertainty. | |
| Current AI agents excel at static datasets, text completion, or simple video games. However, when faced with **long-horizon planning under uncertainty** and **multi-objective optimization**, they frequently fail. | |
| CivicAI bridges this capability gap. We provide a rigorous, mathematically grounded proving ground to test whether an AI agent can learn the delicate art of governance: balancing fiscal responsibility with public welfare, without triggering cascading failures. | |
| ### 🚀 Why This Environment Is Novel | |
| CivicAI is not a grid-world or static dataset problem. It introduces: | |
| * **Long-horizon decision making** (50 steps) | |
| * **Delayed consequences** (policy effects over time) | |
| * **Multi-objective optimization** (economy + health + society) | |
| * **Emergent behavior** (crime, inequality, unrest) | |
| 👉 **This makes it suitable for training real-world decision-making agents, not toy environments.** | |
| --- | |
| ## ⚙️ OpenEnv Compliance (MANDATORY API) | |
| CivicAI fully follows the OpenEnv specification: | |
| * `reset()` → initializes environment with task-specific conditions | |
| * `step(action)` → returns `(observation, reward, done, info)` | |
| * `state()` → returns full internal state | |
| **Typed Models (Pydantic):** | |
| * `Observation`: structured societal metrics | |
| * `Action`: policy vector (tax, budgets, subsidies) | |
| * `Reward`: normalized score `[0.0 – 1.0]` | |
| **`openenv.yaml` includes:** | |
| * Environment metadata | |
| * Action/Observation schema | |
| * Task definitions (easy → hard) | |
| --- | |
| ## 🌍 The Environment | |
| The agent acts as the central policy-maker for a society over a 50-turn episode (where 1 turn = 1 quarter). | |
| ### 🔍 Observation Space (12+ Indicators) | |
| Agents observe a dense, continuous state space mapped to real-world equivalents: | |
| - **Macroeconomics:** GDP ($), GDP Growth (%), Inflation Rate (%), Employment Rate (%). | |
| - **Public Health & Resources:** Health Index (0-1), Infection Rate (%), Medical/Food/Energy Supplies. | |
| - **Social Cohesion:** Public Satisfaction (0-1), Crime Rate (%), Wealth Inequality (Gini coefficient), Social Unrest. | |
| ### ⚙️ Action Space (Continuous & Categorical) | |
| Agents control federal budgets and policy levers at every turn: | |
| - **Tax Rate** (`0.0 - 1.0`): Raises revenue but creates economic drag. | |
| - **Budget Allocations** (`0.0 - 1.0`): Healthcare, Education, and Police budgets. | |
| - **Subsidy Policy**: `none`, `agriculture`, `industry`, or `technology`. | |
| - **Emergency Response**: Lockdowns or stimulus packages. | |
| ### ⚖️ Reward Logic (Dense & Hard-to-Game) | |
| We abandoned naive 0/1 binary rewards for a **highly continuous, anti-exploitation OpenEnv Rubric System**. The reward function is explicitly designed to prevent "gaming" the metrics: | |
| 1. **Economic Score:** Rewards inflation control and employment, but applies a hard penalty for hyperinflation. | |
| 2. **Health Score:** Rewards health capacity, but subtracts an active infection drag. | |
| 3. **Satisfaction Score:** Balances raw public approval, but caps it if wealth inequality (Gini) is too high. | |
| 4. **Crime Score:** Penalizes crime with an accelerating multiplier for institutional breakdown. | |
| 5. **Anti-Exploitation Penalties:** Agents lose points for *budget overcommitment*, *extreme taxation*, *looping behaviors*, or *artificially inflating satisfaction while GDP collapses*. | |
| --- | |
| ## 📋 Tasks & Grader Logic | |
| CivicAI features three difficulty-tiered tasks with distinct initial conditions and deterministic grading logic: | |
| **🟢 Easy: Economic Stability (`stabilize_economy`)** | |
| * **Scenario:** A mild recession is underway. | |
| * **Success Criteria:** Inflation < 6%, Employment > 85%, maintain GDP without deficit spending. | |
| * **Grader Score:** Continuous reward based on deviation from targets. | |
| **🟡 Medium: Pandemic Management (`manage_pandemic`)** | |
| * **Scenario:** A severe virus is sweeping the nation with a 20% infection rate. | |
| * **Success Criteria:** Infection rate < 10%, GDP > $300B. | |
| * **Grader Score:** Tradeoff scoring—balances health capacity vs economic damage from lockdowns. | |
| **🔴 Hard: Social Crisis (`control_crisis`)** | |
| * **Scenario:** Compound multi-domain crisis—high unemployment (32%), high crime (25%), and deep wealth inequality. | |
| * **Success Criteria:** Crime < 12%, Inequality reduced, Employment > 80%. | |
| * **Grader Penalty:** Cascade failure triggered if social unrest breaches threshold. | |
| --- | |
| ## 📈 Training Results (Quantitative) | |
| We trained a GPT-2 policy agent using HuggingFace TRL (Proximal Policy Optimization - PPO) directly in the CivicAI environment. | |
| **Key Results (Economic Stability Task):** | |
| * **Baseline reward:** `0.42` | |
| * **Trained agent reward:** `0.68` | |
| * **Improvement:** `+0.26` (`+61%`) | |
| 👉 **This demonstrates measurable learning, not random behavior.** | |
| ### Reward Curve | |
| Training Reward Curve | |
|  | |
| *The PPO agent successfully learns to outperform the random baseline, finding stable fiscal policies that maximize the multi-objective reward.* | |
| ### Baseline vs. Trained Comparison | |
| Comparison Chart | |
|  | |
| *The trained agent demonstrates significant improvement across all difficulty tiers, particularly in the macroeconomic stabilization task.* | |
| --- | |
| ## 🧪 Reproducibility | |
| **You can reproduce results in under 5 minutes:** | |
| 1. Open the [Colab notebook](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing) | |
| 2. Enable GPU | |
| 3. Run all cells | |
| 4. Observe reward improvement | |
| * The training script uses standard `TRL PPO`. | |
| * The environment is not static — the agent interacts live. | |
| * Plots are generated and saved automatically to `/assets`. | |
| --- | |
| ## 📖 Complete Guide: How It Works (Step-by-Step) | |
| 1. **Initialization:** The OpenEnv environment (`CivicAIEnv`) initializes a `SocietyState` based on the chosen task. | |
| 2. **Observation:** The agent receives the current state of the nation. In the dashboard, you see this visually. In training, the LLM receives this as a text prompt. | |
| 3. **Action / Debate:** | |
| - *In Training:* The LLM policy outputs a JSON action. | |
| - *In Dashboard:* A multi-agent orchestrator facilitates a debate among specialized agents (Economic, Health, Citizen, Ethics) before proposing an optimal consensus action. | |
| 4. **Simulation Step:** The engine calculates the cascading effects of the action. E.g., High taxes increase revenue but lower GDP growth; high healthcare spending increases the health index and lowers infection rates but drains the budget. | |
| 5. **Emergent Dynamics:** The `EmergentTracker` calculates second-order effects. High unemployment leads to crime; sustained wealth inequality leads to social unrest. | |
| 6. **Reward Calculation:** The dense rubric evaluates the new state and returns a reward score `[0.0, 1.0]`, alongside explicit penalties for bad governance. | |
| 7. **Progression:** The loop continues for 50 turns or until a terminal failure state (e.g., mass unemployment, societal collapse) is reached. | |
| --- | |
| ## 🎭 Storytelling: What the Agent Learned | |
| Initially, the agent exploited short-term gains—cutting taxes and overspending to inflate satisfaction. | |
| This strategy collapsed under delayed consequences: GDP contraction, rising crime, and systemic instability. | |
| Through PPO training, the agent learned policy discipline: | |
| * Maintain sustainable taxation | |
| * Allocate budgets efficiently | |
| * Avoid extreme oscillations | |
| 👉 **The agent did not just optimize rewards—it learned stable governance strategies under uncertainty.** | |
| --- | |
| ## 🌍 Why This Matters | |
| CivicAI demonstrates that: | |
| * **AI can learn policy trade-offs**, not just predictions. | |
| * **Reward design can enforce ethical and stable behavior.** | |
| * **Simulation environments can act as safe testing grounds** for governance. | |
| 👉 **This opens pathways for:** | |
| * Policy simulation tools | |
| * Economic modeling | |
| * Crisis response planning | |
| --- | |
| ## 🔗 Links & Resources | |
| - 🚀 **Demo (HuggingFace Space):** [https://huggingface.co/spaces/mahammadaftab/CivicAI/](https://huggingface.co/spaces/mahammadaftab/CivicAI/) | |
| - 📓 **Training Notebook (Colab):** [https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing) | |
| - 📝 **Write-up / HuggingFace Blog:** [Read the HF Blog Post](https://huggingface.co/spaces/mahammadaftab/CivicAI/blob/main/BLOG.md) | |