File size: 5,346 Bytes
5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | # TIL-26-AE: Automated Exploration Bomberman Agent
**Repository**: `E-Rong/til-26-ae-agent`
**Challenge**: The Intelligent League (TIL) — Automated Exploration (AE)
**Base Environment**: `e-rong/til-26-ae` Space
**Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code)
---
## Table of Contents
1. [Research & Literature Review](#1-research--literature-review)
2. [Problem Analysis](#2-problem-analysis)
3. [Development Decisions](#3-development-decisions)
4. [Training Phases](#4-training-phases)
5. [Results](#5-results)
6. [Artifacts](#6-artifacts)
7. [Next Steps](#7-next-steps)
---
## 1. Research & Literature Review
### 1.1 Domain: Multi-Agent Bomberman RL
The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.
### 1.2 Key Papers
| Paper | arXiv ID | Key Insight | Relevance |
|---|---|---|---|
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
### 1.3 Why MaskablePPO?
Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
### 1.4 Why Curriculum Learning?
Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.
### 1.5 Why Not DQN?
DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.
---
## 2. Problem Analysis
### 2.1 Environment Structure
- **Grid size**: 16×16
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
- **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
- **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
- **Episode length**: ~200 steps
### 2.2 Observation Flattening
Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
### 2.3 Action Masking
Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.
---
## 3. Development Decisions
### 3.1 Single-Agent Wrapper
Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
### 3.2 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| **1** | Random | 500k | Learn movement, bombs, basics |
| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |
### 3.3 Philosophy
- `stable-baselines3` for PPO core
- `sb3-contrib` for MaskablePPO + ActionMasker
- `huggingface_hub` for persistent checkpoint storage
### 3.4 Why Hub Every 50k Steps
Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
---
## 4. Training Phases
### 4.1 Phase 1: Foundation (vs Random)
**Duration**: 500,352 steps
**Result**: Win rate 92%, avg reward 180.1, 100% survival
**Challenges**: Wrapper ordering, dependency issues, sandbox resets
### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
**Status**: Started at 500352 steps, running on A10G at ~54 FPS
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
**ETA**: ~2.5 hours, targets 1,000,352 total steps
**Purpose**: Force map exploration, prevent safe base-camping
### 4.3 Phase 3: Curriculum Self-Play
**Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps
---
## 5. Results
### 5.1 Phase 1 Results
| Metric | Value |
|---|---|
| Timesteps | 500,352 |
| Final Reward | 237.0 |
| FPS | 52 (A10G) |
| Wall time | ~2h 15min |
| Win Rate (eval) | **92.0%** |
| Avg Reward (eval) | **180.1** |
| Survival Rate | **100.0%** |
### 5.2 Phase 2 Interim (Early)
| Metric | Value |
|---|---|
| Starting Step | 500,352 |
| Initial Reward (shaped) | 210 |
| FPS | 54 |
| Explore Weight | Adaptive k=1.2 |
---
## 6. Artifacts
| File | Purpose |
|---|---|
| `phase1_final.zip` | Trained model |
| `phase2_final.zip` | *(in progress)* |
| `ckpt_50000-400000.zip` | Phase 1 intermediates |
| `ae_manager.py` | Inference code |
| `docs/ae.md` | This documentation |
---
## 7. Next Steps
- **Phase 2**: Complete 500k exploration-shaping steps
- **Phase 3**: Curriculum vs rule-based opponents (1M steps)
- **Eval**: Multi-team evaluation vs smart opponents
- **Future**: CNN policy, opponent modeling, LSTM memory
*Last updated: 2026-05-14 — Phase 2 in progress*
|