# TIL-26-AE: Automated Exploration Bomberman Agent **Repository**: `E-Rong/til-26-ae-agent` **Challenge**: The Intelligent League (TIL) — Automated Exploration (AE) **Base Environment**: `e-rong/til-26-ae` Space **Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code) --- ## Table of Contents 1. [Research & Literature Review](#1-research--literature-review) 2. [Problem Analysis](#2-problem-analysis) 3. [Development Decisions](#3-development-decisions) 4. [Training Phases](#4-training-phases) 5. [Results](#5-results) 6. [Artifacts](#6-artifacts) 7. [Next Steps](#7-next-steps) --- ## 1. Research & Literature Review ### 1.1 Domain: Multi-Agent Bomberman RL The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**. ### 1.2 Key Papers | Paper | arXiv ID | Key Insight | Relevance | |---|---|---|---| | *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach | | *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum | | *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** | | *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN | ### 1.3 Why MaskablePPO? Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled. ### 1.4 Why Curriculum Learning? Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL. ### 1.5 Why Not DQN? DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`. --- ## 2. Problem Analysis ### 2.1 Environment Structure - **Grid size**: 16×16 - **Agents**: Configurable (default 2 teams, Phase 3 uses 3) - **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc. - **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB - **Episode length**: ~200 steps ### 2.2 Observation Flattening Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars. ### 2.3 Action Masking Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. --- ## 3. Development Decisions ### 3.1 Single-Agent Wrapper Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment. ### 3.2 3-Phase Curriculum | Phase | Opponent | Duration | Purpose | |---|---|---|---| | **1** | Random | 500k | Learn movement, bombs, basics | | **2** | Random + exploration bonus | 500k | Prevent camping exploit | | **3** | Rule-based curriculum | 1M | Generalize to structured opponents | ### 3.3 Philosophy - `stable-baselines3` for PPO core - `sb3-contrib` for MaskablePPO + ActionMasker - `huggingface_hub` for persistent checkpoint storage ### 3.4 Why Hub Every 50k Steps Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed. --- ## 4. Training Phases ### 4.1 Phase 1: Foundation (vs Random) **Duration**: 500,352 steps **Result**: Win rate 92%, avg reward 180.1, 100% survival **Challenges**: Wrapper ordering, dependency issues, sandbox resets ### 4.2 Phase 2: Exploration Shaping (COMPLETE) **Duration**: 500,408 additional steps (600,352 → 1,001,760) **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) **Hardware**: A10G, ~50 FPS **Wall time**: ~2h 45min **Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1 **Key insight**: Reward decreased (180→153) but win rate increased (92%→93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward. ### 4.3 Phase 3: Curriculum Self-Play (PENDING) **Script**: `phase3_curriculum.py` (ready on Hub) **Plan**: 5-stage rule-based curriculum — static → random → simple_bomb → evasive → mixed **Duration**: 1M steps **Advancement gate**: >55% win rate per stage --- ## 5. Results ### 5.1 Phase 1 Results | Metric | Value | |---|---| | Timesteps | 500,352 | | Final Reward | 237.0 | | FPS | 52 (A10G) | | Wall time | ~2h 15min | | Win Rate (eval) | **92.0%** | | Avg Reward (eval) | **180.1** | | Survival Rate | **100.0%** | ### 5.2 Phase 2 Results | Metric | Value | |---|---| | Timesteps | 1,001,760 total (500,408 new) | | FPS | 50 (A10G) | | Wall time | ~2h 45min | | Win Rate (eval) | **93.0%** | | Avg Reward (eval) | **153.4** | | Avg Bombs | **20.1** | --- ## 6. Artifacts | File | Purpose | |---|---| | `phase1_final.zip` | Phase 1 complete checkpoint | | `phase2_final.zip` | Phase 2 complete checkpoint | | `phase2_ckpt_*.zip` | Phase 2 intermediates (650k–1M) | | `phase2_eval_results.txt` | Phase 2 evaluation metrics | | `ae_manager.py` | Inference code | | `docs/ae.md` | This documentation | --- ## 7. Next Steps - [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`) - [ ] Monitor 5-stage curriculum progression - [ ] Evaluate final model vs mixed rule-based opponents - [ ] Future: CNN policy, opponent modeling, LSTM memory *Last updated: 2026-05-14 — Phase 2 complete, Phase 3 ready*