--- title: CivicAI Society Simulator emoji: ๐Ÿ›๏ธ colorFrom: green colorTo: blue sdk: docker app_port: 7860 app_file: server/app.py pinned: false --- # ๐Ÿ›๏ธ CivicAI: AI-Driven Societal Policy Optimization Under Uncertainty [![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-06b6d4?style=for-the-badge)](https://github.com/meta-pytorch/OpenEnv) [![Python](https://img.shields.io/badge/Python-3.10+-3776AB?style=for-the-badge)](https://python.org) [![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE) > **Governing a society of 10 million people is not a game of chess. It is a balancing act of competing objectives, delayed consequences, and structural inequalities.** CivicAI is a production-grade, multi-agent societal decision-making environment designed for the **OpenEnv Hackathon**. It challenges Reinforcement Learning (RL) agents and LLMs to manage a dynamic, non-linear macro-society without causing economic collapse, pandemic outbreaks, or social revolutions. --- ## ๐ŸŽฏ The Problem **What real-world problem do we solve?** Modern governments face a combinatorial decision-making problem. Thousands of interdependent policy levers (taxes, healthcare spending, education, policing, subsidies) interact through complex causal chains to produce emergent societal outcomesโ€”often with weeks-to-years of lag and high uncertainty. Current AI agents excel at static datasets, text completion, or simple video games. However, when faced with **long-horizon planning under uncertainty** and **multi-objective optimization**, they frequently fail. CivicAI bridges this capability gap. We provide a rigorous, mathematically grounded proving ground to test whether an AI agent can learn the delicate art of governance: balancing fiscal responsibility with public welfare, without triggering cascading failures. ### ๐Ÿš€ Why This Environment Is Novel CivicAI is not a grid-world or static dataset problem. It introduces: * **Long-horizon decision making** (50 steps) * **Delayed consequences** (policy effects over time) * **Multi-objective optimization** (economy + health + society) * **Emergent behavior** (crime, inequality, unrest) ๐Ÿ‘‰ **This makes it suitable for training real-world decision-making agents, not toy environments.** --- ## โš™๏ธ OpenEnv Compliance (MANDATORY API) CivicAI fully follows the OpenEnv specification: * `reset()` โ†’ initializes environment with task-specific conditions * `step(action)` โ†’ returns `(observation, reward, done, info)` * `state()` โ†’ returns full internal state **Typed Models (Pydantic):** * `Observation`: structured societal metrics * `Action`: policy vector (tax, budgets, subsidies) * `Reward`: normalized score `[0.0 โ€“ 1.0]` **`openenv.yaml` includes:** * Environment metadata * Action/Observation schema * Task definitions (easy โ†’ hard) --- ## ๐ŸŒ The Environment The agent acts as the central policy-maker for a society over a 50-turn episode (where 1 turn = 1 quarter). ### ๐Ÿ” Observation Space (12+ Indicators) Agents observe a dense, continuous state space mapped to real-world equivalents: - **Macroeconomics:** GDP ($), GDP Growth (%), Inflation Rate (%), Employment Rate (%). - **Public Health & Resources:** Health Index (0-1), Infection Rate (%), Medical/Food/Energy Supplies. - **Social Cohesion:** Public Satisfaction (0-1), Crime Rate (%), Wealth Inequality (Gini coefficient), Social Unrest. ### โš™๏ธ Action Space (Continuous & Categorical) Agents control federal budgets and policy levers at every turn: - **Tax Rate** (`0.0 - 1.0`): Raises revenue but creates economic drag. - **Budget Allocations** (`0.0 - 1.0`): Healthcare, Education, and Police budgets. - **Subsidy Policy**: `none`, `agriculture`, `industry`, or `technology`. - **Emergency Response**: Lockdowns or stimulus packages. ### โš–๏ธ Reward Logic (Dense & Hard-to-Game) We abandoned naive 0/1 binary rewards for a **highly continuous, anti-exploitation OpenEnv Rubric System**. The reward function is explicitly designed to prevent "gaming" the metrics: 1. **Economic Score:** Rewards inflation control and employment, but applies a hard penalty for hyperinflation. 2. **Health Score:** Rewards health capacity, but subtracts an active infection drag. 3. **Satisfaction Score:** Balances raw public approval, but caps it if wealth inequality (Gini) is too high. 4. **Crime Score:** Penalizes crime with an accelerating multiplier for institutional breakdown. 5. **Anti-Exploitation Penalties:** Agents lose points for *budget overcommitment*, *extreme taxation*, *looping behaviors*, or *artificially inflating satisfaction while GDP collapses*. --- ## ๐Ÿ“‹ Tasks & Grader Logic CivicAI features three difficulty-tiered tasks with distinct initial conditions and deterministic grading logic: **๐ŸŸข Easy: Economic Stability (`stabilize_economy`)** * **Scenario:** A mild recession is underway. * **Success Criteria:** Inflation < 6%, Employment > 85%, maintain GDP without deficit spending. * **Grader Score:** Continuous reward based on deviation from targets. **๐ŸŸก Medium: Pandemic Management (`manage_pandemic`)** * **Scenario:** A severe virus is sweeping the nation with a 20% infection rate. * **Success Criteria:** Infection rate < 10%, GDP > $300B. * **Grader Score:** Tradeoff scoringโ€”balances health capacity vs economic damage from lockdowns. **๐Ÿ”ด Hard: Social Crisis (`control_crisis`)** * **Scenario:** Compound multi-domain crisisโ€”high unemployment (32%), high crime (25%), and deep wealth inequality. * **Success Criteria:** Crime < 12%, Inequality reduced, Employment > 80%. * **Grader Penalty:** Cascade failure triggered if social unrest breaches threshold. --- ## ๐Ÿ“ˆ Training Results (Quantitative) We trained a GPT-2 policy agent using HuggingFace TRL (Proximal Policy Optimization - PPO) directly in the CivicAI environment. **Key Results (Economic Stability Task):** * **Baseline reward:** `0.42` * **Trained agent reward:** `0.68` * **Improvement:** `+0.26` (`+61%`) ๐Ÿ‘‰ **This demonstrates measurable learning, not random behavior.** ### Reward Curve Training Reward Curve ![Screenshot 2026-04-26 163716](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/J_jMixXqJNBc7AEYp4hxr.png) *The PPO agent successfully learns to outperform the random baseline, finding stable fiscal policies that maximize the multi-objective reward.* ### Baseline vs. Trained Comparison Comparison Chart ![Screenshot 2026-04-26 164009](https://cdn-uploads.huggingface.co/production/uploads/68e1066110db6d257dfceb12/tNnWRZDymTsXVTPfVbtAt.png) *The trained agent demonstrates significant improvement across all difficulty tiers, particularly in the macroeconomic stabilization task.* --- ## ๐Ÿงช Reproducibility **You can reproduce results in under 5 minutes:** 1. Open the [Colab notebook](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing) 2. Enable GPU 3. Run all cells 4. Observe reward improvement * The training script uses standard `TRL PPO`. * The environment is not static โ€” the agent interacts live. * Plots are generated and saved automatically to `/assets`. --- ## ๐Ÿ“– Complete Guide: How It Works (Step-by-Step) 1. **Initialization:** The OpenEnv environment (`CivicAIEnv`) initializes a `SocietyState` based on the chosen task. 2. **Observation:** The agent receives the current state of the nation. In the dashboard, you see this visually. In training, the LLM receives this as a text prompt. 3. **Action / Debate:** - *In Training:* The LLM policy outputs a JSON action. - *In Dashboard:* A multi-agent orchestrator facilitates a debate among specialized agents (Economic, Health, Citizen, Ethics) before proposing an optimal consensus action. 4. **Simulation Step:** The engine calculates the cascading effects of the action. E.g., High taxes increase revenue but lower GDP growth; high healthcare spending increases the health index and lowers infection rates but drains the budget. 5. **Emergent Dynamics:** The `EmergentTracker` calculates second-order effects. High unemployment leads to crime; sustained wealth inequality leads to social unrest. 6. **Reward Calculation:** The dense rubric evaluates the new state and returns a reward score `[0.0, 1.0]`, alongside explicit penalties for bad governance. 7. **Progression:** The loop continues for 50 turns or until a terminal failure state (e.g., mass unemployment, societal collapse) is reached. --- ## ๐ŸŽญ Storytelling: What the Agent Learned Initially, the agent exploited short-term gainsโ€”cutting taxes and overspending to inflate satisfaction. This strategy collapsed under delayed consequences: GDP contraction, rising crime, and systemic instability. Through PPO training, the agent learned policy discipline: * Maintain sustainable taxation * Allocate budgets efficiently * Avoid extreme oscillations ๐Ÿ‘‰ **The agent did not just optimize rewardsโ€”it learned stable governance strategies under uncertainty.** --- ## ๐ŸŒ Why This Matters CivicAI demonstrates that: * **AI can learn policy trade-offs**, not just predictions. * **Reward design can enforce ethical and stable behavior.** * **Simulation environments can act as safe testing grounds** for governance. ๐Ÿ‘‰ **This opens pathways for:** * Policy simulation tools * Economic modeling * Crisis response planning --- ## ๐Ÿ”— Links & Resources - ๐Ÿš€ **Demo (HuggingFace Space):** [https://huggingface.co/spaces/mahammadaftab/CivicAI/](https://huggingface.co/spaces/mahammadaftab/CivicAI/) - ๐Ÿ““ **Training Notebook (Colab):** [https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing](https://colab.research.google.com/drive/1VhW1LdFTEuQ9i9h65EDxl5_H3qmD1H0v?usp=sharing) - ๐Ÿ“ **Write-up / HuggingFace Blog:** [Read the HF Blog Post](https://huggingface.co/spaces/mahammadaftab/CivicAI/blob/main/BLOG.md)