File size: 5,821 Bytes
5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 a00144d 5c6cad0 47c41e4 5c6cad0 47c41e4 a00144d 47c41e4 5c6cad0 47c41e4 5c6cad0 47c41e4 5c6cad0 a00144d 5c6cad0 47c41e4 5c6cad0 47c41e4 5c6cad0 47c41e4 a00144d 5c6cad0 47c41e4 5c6cad0 47c41e4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 | # TIL-26-AE: Automated Exploration Bomberman Agent
**Repository**: `E-Rong/til-26-ae-agent`
**Challenge**: The Intelligent League (TIL) β Automated Exploration (AE)
**Base Environment**: `e-rong/til-26-ae` Space
**Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code)
---
## Table of Contents
1. [Research & Literature Review](#1-research--literature-review)
2. [Problem Analysis](#2-problem-analysis)
3. [Development Decisions](#3-development-decisions)
4. [Training Phases](#4-training-phases)
5. [Results](#5-results)
6. [Artifacts](#6-artifacts)
7. [Next Steps](#7-next-steps)
---
## 1. Research & Literature Review
### 1.1 Domain: Multi-Agent Bomberman RL
The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.
### 1.2 Key Papers
| Paper | arXiv ID | Key Insight | Relevance |
|---|---|---|---|
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
### 1.3 Why MaskablePPO?
Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
### 1.4 Why Curriculum Learning?
Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy β hard) is standard in competitive multi-agent RL.
### 1.5 Why Not DQN?
DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.
---
## 2. Problem Analysis
### 2.1 Environment Structure
- **Grid size**: 16Γ16
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
- **Observations**: Dict with `agent_viewcone[7Γ5Γ25]`, `base_viewcone[5Γ5Γ25]`, direction, location, health, `action_mask[6]`, etc.
- **Actions**: Discrete(6) β FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
- **Episode length**: ~200 steps
### 2.2 Observation Flattening
Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
### 2.3 Action Masking
Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.
---
## 3. Development Decisions
### 3.1 Single-Agent Wrapper
Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
### 3.2 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| **1** | Random | 500k | Learn movement, bombs, basics |
| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |
### 3.3 Philosophy
- `stable-baselines3` for PPO core
- `sb3-contrib` for MaskablePPO + ActionMasker
- `huggingface_hub` for persistent checkpoint storage
### 3.4 Why Hub Every 50k Steps
Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
---
## 4. Training Phases
### 4.1 Phase 1: Foundation (vs Random)
**Duration**: 500,352 steps
**Result**: Win rate 92%, avg reward 180.1, 100% survival
**Challenges**: Wrapper ordering, dependency issues, sandbox resets
### 4.2 Phase 2: Exploration Shaping (COMPLETE)
**Duration**: 500,408 additional steps (600,352 β 1,001,760)
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
**Hardware**: A10G, ~50 FPS
**Wall time**: ~2h 45min
**Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
**Key insight**: Reward decreased (180β153) but win rate increased (92%β93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.
### 4.3 Phase 3: Curriculum Self-Play (PENDING)
**Script**: `phase3_curriculum.py` (ready on Hub)
**Plan**: 5-stage rule-based curriculum β static β random β simple_bomb β evasive β mixed
**Duration**: 1M steps
**Advancement gate**: >55% win rate per stage
---
## 5. Results
### 5.1 Phase 1 Results
| Metric | Value |
|---|---|
| Timesteps | 500,352 |
| Final Reward | 237.0 |
| FPS | 52 (A10G) |
| Wall time | ~2h 15min |
| Win Rate (eval) | **92.0%** |
| Avg Reward (eval) | **180.1** |
| Survival Rate | **100.0%** |
### 5.2 Phase 2 Results
| Metric | Value |
|---|---|
| Timesteps | 1,001,760 total (500,408 new) |
| FPS | 50 (A10G) |
| Wall time | ~2h 45min |
| Win Rate (eval) | **93.0%** |
| Avg Reward (eval) | **153.4** |
| Avg Bombs | **20.1** |
---
## 6. Artifacts
| File | Purpose |
|---|---|
| `phase1_final.zip` | Phase 1 complete checkpoint |
| `phase2_final.zip` | Phase 2 complete checkpoint |
| `phase2_ckpt_*.zip` | Phase 2 intermediates (650kβ1M) |
| `phase2_eval_results.txt` | Phase 2 evaluation metrics |
| `ae_manager.py` | Inference code |
| `docs/ae.md` | This documentation |
---
## 7. Next Steps
- [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
- [ ] Monitor 5-stage curriculum progression
- [ ] Evaluate final model vs mixed rule-based opponents
- [ ] Future: CNN policy, opponent modeling, LSTM memory
*Last updated: 2026-05-14 β Phase 2 complete, Phase 3 ready*
|