| # TIL-26-AE: Automated Exploration Bomberman Agent |
|
|
| **Repository**: `E-Rong/til-26-ae-agent` |
| **Challenge**: The Intelligent League (TIL) β Automated Exploration (AE) |
| **Base Environment**: `e-rong/til-26-ae` Space |
| **Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code) |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Research & Literature Review](#1-research--literature-review) |
| 2. [Problem Analysis](#2-problem-analysis) |
| 3. [Development Decisions](#3-development-decisions) |
| 4. [Training Phases](#4-training-phases) |
| 5. [Results](#5-results) |
| 6. [Artifacts](#6-artifacts) |
| 7. [Next Steps](#7-next-steps) |
|
|
| --- |
|
|
| ## 1. Research & Literature Review |
|
|
| ### 1.1 Domain: Multi-Agent Bomberman RL |
|
|
| The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**. |
|
|
| ### 1.2 Key Papers |
|
|
| | Paper | arXiv ID | Key Insight | Relevance | |
| |---|---|---|---| |
| | *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach | |
| | *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum | |
| | *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** | |
| | *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN | |
|
|
| ### 1.3 Why MaskablePPO? |
|
|
| Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled. |
|
|
| ### 1.4 Why Curriculum Learning? |
|
|
| Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy β hard) is standard in competitive multi-agent RL. |
|
|
| ### 1.5 Why Not DQN? |
|
|
| DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`. |
|
|
| --- |
|
|
| ## 2. Problem Analysis |
|
|
| ### 2.1 Environment Structure |
|
|
| - **Grid size**: 16Γ16 |
| - **Agents**: Configurable (default 2 teams, Phase 3 uses 3) |
| - **Observations**: Dict with `agent_viewcone[7Γ5Γ25]`, `base_viewcone[5Γ5Γ25]`, direction, location, health, `action_mask[6]`, etc. |
| - **Actions**: Discrete(6) β FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB |
| - **Episode length**: ~200 steps |
| |
| ### 2.2 Observation Flattening |
| |
| Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars. |
| |
| ### 2.3 Action Masking |
| |
| Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. |
|
|
| --- |
|
|
| ## 3. Development Decisions |
|
|
| ### 3.1 Single-Agent Wrapper |
|
|
| Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment. |
|
|
| ### 3.2 3-Phase Curriculum |
|
|
| | Phase | Opponent | Duration | Purpose | |
| |---|---|---|---| |
| | **1** | Random | 500k | Learn movement, bombs, basics | |
| | **2** | Random + exploration bonus | 500k | Prevent camping exploit | |
| | **3** | Rule-based curriculum | 1M | Generalize to structured opponents | |
|
|
| ### 3.3 Philosophy |
|
|
| - `stable-baselines3` for PPO core |
| - `sb3-contrib` for MaskablePPO + ActionMasker |
| - `huggingface_hub` for persistent checkpoint storage |
|
|
| ### 3.4 Why Hub Every 50k Steps |
|
|
| Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed. |
|
|
| --- |
|
|
| ## 4. Training Phases |
|
|
| ### 4.1 Phase 1: Foundation (vs Random) |
|
|
| **Duration**: 500,352 steps |
| **Result**: Win rate 92%, avg reward 180.1, 100% survival |
| **Challenges**: Wrapper ordering, dependency issues, sandbox resets |
|
|
| ### 4.2 Phase 2: Exploration Shaping (COMPLETE) |
|
|
| **Duration**: 500,408 additional steps (600,352 β 1,001,760) |
| **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) |
| **Hardware**: A10G, ~50 FPS |
| **Wall time**: ~2h 45min |
| **Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1 |
| **Key insight**: Reward decreased (180β153) but win rate increased (92%β93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward. |
|
|
| ### 4.3 Phase 3: Curriculum Self-Play (PENDING) |
|
|
| **Script**: `phase3_curriculum.py` (ready on Hub) |
| **Plan**: 5-stage rule-based curriculum β static β random β simple_bomb β evasive β mixed |
| **Duration**: 1M steps |
| **Advancement gate**: >55% win rate per stage |
| |
| --- |
| |
| ## 5. Results |
| |
| ### 5.1 Phase 1 Results |
| |
| | Metric | Value | |
| |---|---| |
| | Timesteps | 500,352 | |
| | Final Reward | 237.0 | |
| | FPS | 52 (A10G) | |
| | Wall time | ~2h 15min | |
| | Win Rate (eval) | **92.0%** | |
| | Avg Reward (eval) | **180.1** | |
| | Survival Rate | **100.0%** | |
| |
| ### 5.2 Phase 2 Results |
| |
| | Metric | Value | |
| |---|---| |
| | Timesteps | 1,001,760 total (500,408 new) | |
| | FPS | 50 (A10G) | |
| | Wall time | ~2h 45min | |
| | Win Rate (eval) | **93.0%** | |
| | Avg Reward (eval) | **153.4** | |
| | Avg Bombs | **20.1** | |
| |
| --- |
| |
| ## 6. Artifacts |
| |
| | File | Purpose | |
| |---|---| |
| | `phase1_final.zip` | Phase 1 complete checkpoint | |
| | `phase2_final.zip` | Phase 2 complete checkpoint | |
| | `phase2_ckpt_*.zip` | Phase 2 intermediates (650kβ1M) | |
| | `phase2_eval_results.txt` | Phase 2 evaluation metrics | |
| | `ae_manager.py` | Inference code | |
| | `docs/ae.md` | This documentation | |
|
|
| --- |
|
|
| ## 7. Next Steps |
|
|
| - [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`) |
| - [ ] Monitor 5-stage curriculum progression |
| - [ ] Evaluate final model vs mixed rule-based opponents |
| - [ ] Future: CNN policy, opponent modeling, LSTM memory |
|
|
| *Last updated: 2026-05-14 β Phase 2 complete, Phase 3 ready* |
|
|