E-Rong's picture
Update docs: Phase 2 started, add interim results
a00144d verified
|
raw
history blame
5.35 kB
# TIL-26-AE: Automated Exploration Bomberman Agent
**Repository**: `E-Rong/til-26-ae-agent`
**Challenge**: The Intelligent League (TIL) β€” Automated Exploration (AE)
**Base Environment**: `e-rong/til-26-ae` Space
**Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code)
---
## Table of Contents
1. [Research & Literature Review](#1-research--literature-review)
2. [Problem Analysis](#2-problem-analysis)
3. [Development Decisions](#3-development-decisions)
4. [Training Phases](#4-training-phases)
5. [Results](#5-results)
6. [Artifacts](#6-artifacts)
7. [Next Steps](#7-next-steps)
---
## 1. Research & Literature Review
### 1.1 Domain: Multi-Agent Bomberman RL
The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.
### 1.2 Key Papers
| Paper | arXiv ID | Key Insight | Relevance |
|---|---|---|---|
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
### 1.3 Why MaskablePPO?
Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
### 1.4 Why Curriculum Learning?
Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy β†’ hard) is standard in competitive multi-agent RL.
### 1.5 Why Not DQN?
DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.
---
## 2. Problem Analysis
### 2.1 Environment Structure
- **Grid size**: 16Γ—16
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
- **Observations**: Dict with `agent_viewcone[7Γ—5Γ—25]`, `base_viewcone[5Γ—5Γ—25]`, direction, location, health, `action_mask[6]`, etc.
- **Actions**: Discrete(6) β€” FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
- **Episode length**: ~200 steps
### 2.2 Observation Flattening
Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
### 2.3 Action Masking
Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.
---
## 3. Development Decisions
### 3.1 Single-Agent Wrapper
Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
### 3.2 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| **1** | Random | 500k | Learn movement, bombs, basics |
| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |
### 3.3 Philosophy
- `stable-baselines3` for PPO core
- `sb3-contrib` for MaskablePPO + ActionMasker
- `huggingface_hub` for persistent checkpoint storage
### 3.4 Why Hub Every 50k Steps
Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
---
## 4. Training Phases
### 4.1 Phase 1: Foundation (vs Random)
**Duration**: 500,352 steps
**Result**: Win rate 92%, avg reward 180.1, 100% survival
**Challenges**: Wrapper ordering, dependency issues, sandbox resets
### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
**Status**: Started at 500352 steps, running on A10G at ~54 FPS
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
**ETA**: ~2.5 hours, targets 1,000,352 total steps
**Purpose**: Force map exploration, prevent safe base-camping
### 4.3 Phase 3: Curriculum Self-Play
**Pending**: Rule-based static β†’ simple β†’ smart β†’ mixed, 3 teams, 1M steps
---
## 5. Results
### 5.1 Phase 1 Results
| Metric | Value |
|---|---|
| Timesteps | 500,352 |
| Final Reward | 237.0 |
| FPS | 52 (A10G) |
| Wall time | ~2h 15min |
| Win Rate (eval) | **92.0%** |
| Avg Reward (eval) | **180.1** |
| Survival Rate | **100.0%** |
### 5.2 Phase 2 Interim (Early)
| Metric | Value |
|---|---|
| Starting Step | 500,352 |
| Initial Reward (shaped) | 210 |
| FPS | 54 |
| Explore Weight | Adaptive k=1.2 |
---
## 6. Artifacts
| File | Purpose |
|---|---|
| `phase1_final.zip` | Trained model |
| `phase2_final.zip` | *(in progress)* |
| `ckpt_50000-400000.zip` | Phase 1 intermediates |
| `ae_manager.py` | Inference code |
| `docs/ae.md` | This documentation |
---
## 7. Next Steps
- **Phase 2**: Complete 500k exploration-shaping steps
- **Phase 3**: Curriculum vs rule-based opponents (1M steps)
- **Eval**: Multi-team evaluation vs smart opponents
- **Future**: CNN policy, opponent modeling, LSTM memory
*Last updated: 2026-05-14 β€” Phase 2 in progress*