| # TIL-26-AE: Automated Exploration Bomberman Agent |
|
|
| **Repository**: `E-Rong/til-26-ae-agent` |
| **Challenge**: The Intelligent League (TIL) — Automated Exploration (AE) |
| **Base Environment**: `e-rong/til-26-ae` Space |
| **Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code) |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Research & Literature Review](#1-research--literature-review) |
| 2. [Problem Analysis](#2-problem-analysis) |
| 3. [Development Decisions](#3-development-decisions) |
| 4. [Training Phases](#4-training-phases) |
| 5. [Results](#5-results) |
| 6. [Artifacts](#6-artifacts) |
| 7. [Next Steps](#7-next-steps) |
|
|
| --- |
|
|
| ## 1. Research & Literature Review |
|
|
| ### 1.1 Domain: Multi-Agent Bomberman RL |
|
|
| The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**. |
|
|
| ### 1.2 Key Papers |
|
|
| | Paper | arXiv ID | Key Insight | Relevance | |
| |---|---|---|---| |
| | *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach | |
| | *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum | |
| | *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** | |
| | *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN | |
|
|
| ### 1.3 Why MaskablePPO? |
|
|
| Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled. |
|
|
| ### 1.4 Why Curriculum Learning? |
|
|
| Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL. |
|
|
| ### 1.5 Why Not DQN? |
|
|
| DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`. |
|
|
| --- |
|
|
| ## 2. Problem Analysis |
|
|
| ### 2.1 Environment Structure |
|
|
| - **Grid size**: 16×16 |
| - **Agents**: Configurable (default 2 teams, Phase 3 uses 3) |
| - **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc. |
| - **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB |
| - **Episode length**: ~200 steps |
| |
| ### 2.2 Observation Flattening |
| |
| Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars. |
| |
| ### 2.3 Action Masking |
| |
| Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. |
|
|
| --- |
|
|
| ## 3. Development Decisions |
|
|
| ### 3.1 Single-Agent Wrapper |
|
|
| Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment. |
|
|
| ### 3.2 3-Phase Curriculum |
|
|
| | Phase | Opponent | Duration | Purpose | |
| |---|---|---|---| |
| | **1** | Random | 500k | Learn movement, bombs, basics | |
| | **2** | Random + exploration bonus | 500k | Prevent camping exploit | |
| | **3** | Rule-based curriculum | 1M | Generalize to structured opponents | |
|
|
| ### 3.3 Philosophy |
|
|
| - `stable-baselines3` for PPO core |
| - `sb3-contrib` for MaskablePPO + ActionMasker |
| - `huggingface_hub` for persistent checkpoint storage |
|
|
| ### 3.4 Why Hub Every 50k Steps |
|
|
| Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed. |
|
|
| --- |
|
|
| ## 4. Training Phases |
|
|
| ### 4.1 Phase 1: Foundation (vs Random) |
|
|
| **Duration**: 500,352 steps |
| **Result**: Win rate 92%, avg reward 180.1, 100% survival |
| **Challenges**: Wrapper ordering, dependency issues, sandbox resets |
|
|
| ### 4.2 Phase 2: Exploration Shaping (IN PROGRESS) |
|
|
| **Status**: Started at 500352 steps, running on A10G at ~54 FPS |
| **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) |
| **ETA**: ~2.5 hours, targets 1,000,352 total steps |
| **Purpose**: Force map exploration, prevent safe base-camping |
|
|
| ### 4.3 Phase 3: Curriculum Self-Play |
|
|
| **Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps |
|
|
| --- |
|
|
| ## 5. Results |
|
|
| ### 5.1 Phase 1 Results |
|
|
| | Metric | Value | |
| |---|---| |
| | Timesteps | 500,352 | |
| | Final Reward | 237.0 | |
| | FPS | 52 (A10G) | |
| | Wall time | ~2h 15min | |
| | Win Rate (eval) | **92.0%** | |
| | Avg Reward (eval) | **180.1** | |
| | Survival Rate | **100.0%** | |
|
|
| ### 5.2 Phase 2 Interim (Early) |
|
|
| | Metric | Value | |
| |---|---| |
| | Starting Step | 500,352 | |
| | Initial Reward (shaped) | 210 | |
| | FPS | 54 | |
| | Explore Weight | Adaptive k=1.2 | |
|
|
| --- |
|
|
| ## 6. Artifacts |
|
|
| | File | Purpose | |
| |---|---| |
| | `phase1_final.zip` | Trained model | |
| | `phase2_final.zip` | *(in progress)* | |
| | `ckpt_50000-400000.zip` | Phase 1 intermediates | |
| | `ae_manager.py` | Inference code | |
| | `docs/ae.md` | This documentation | |
|
|
| --- |
|
|
| ## 7. Next Steps |
|
|
| - **Phase 2**: Complete 500k exploration-shaping steps |
| - **Phase 3**: Curriculum vs rule-based opponents (1M steps) |
| - **Eval**: Multi-team evaluation vs smart opponents |
| - **Future**: CNN policy, opponent modeling, LSTM memory |
|
|
| *Last updated: 2026-05-14 — Phase 2 in progress* |
|
|