File size: 5,821 Bytes

5c6cad0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a00144d
5c6cad0
a00144d
5c6cad0
 
 
a00144d
 
 
 
5c6cad0
 
 
a00144d
5c6cad0
 
 
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
 
 
 
 
a00144d
5c6cad0
a00144d
 
 
5c6cad0
 
 
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
 
 
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
a00144d
 
 
5c6cad0
a00144d
5c6cad0
a00144d
 
 
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
 
 
5c6cad0
47c41e4
5c6cad0
47c41e4
a00144d
47c41e4
 
 
 
5c6cad0
47c41e4
5c6cad0
47c41e4
 
 
 
5c6cad0
 
 
 
 
 
 
 
 
a00144d
 
 
 
 
 
 
5c6cad0
47c41e4
5c6cad0
 
 
47c41e4
 
 
 
 
 
5c6cad0
 
 
 
 
 
 
47c41e4
 
 
 
a00144d
 
5c6cad0
 
 
 
 
47c41e4
 
 
 
5c6cad0
47c41e4

# TIL-26-AE: Automated Exploration Bomberman Agent

**Repository**: `E-Rong/til-26-ae-agent`
**Challenge**: The Intelligent League (TIL) — Automated Exploration (AE)
**Base Environment**: `e-rong/til-26-ae` Space
**Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code)

---

## Table of Contents

1. [Research & Literature Review](#1-research--literature-review)
2. [Problem Analysis](#2-problem-analysis)
3. [Development Decisions](#3-development-decisions)
4. [Training Phases](#4-training-phases)
5. [Results](#5-results)
6. [Artifacts](#6-artifacts)
7. [Next Steps](#7-next-steps)

---

## 1. Research & Literature Review

### 1.1 Domain: Multi-Agent Bomberman RL

The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.

### 1.2 Key Papers

| Paper | arXiv ID | Key Insight | Relevance |
|---|---|---|---|
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |

### 1.3 Why MaskablePPO?

Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

### 1.4 Why Curriculum Learning?

Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

### 1.5 Why Not DQN?

DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.

---

## 2. Problem Analysis

### 2.1 Environment Structure

- **Grid size**: 16×16
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
- **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
- **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
- **Episode length**: ~200 steps

### 2.2 Observation Flattening

Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

### 2.3 Action Masking

Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.

---

## 3. Development Decisions

### 3.1 Single-Agent Wrapper

Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

### 3.2 3-Phase Curriculum

| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| **1** | Random | 500k | Learn movement, bombs, basics |
| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |

### 3.3 Philosophy

- `stable-baselines3` for PPO core
- `sb3-contrib` for MaskablePPO + ActionMasker
- `huggingface_hub` for persistent checkpoint storage

### 3.4 Why Hub Every 50k Steps

Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.

---

## 4. Training Phases

### 4.1 Phase 1: Foundation (vs Random)

**Duration**: 500,352 steps
**Result**: Win rate 92%, avg reward 180.1, 100% survival
**Challenges**: Wrapper ordering, dependency issues, sandbox resets

### 4.2 Phase 2: Exploration Shaping (COMPLETE)

**Duration**: 500,408 additional steps (600,352 → 1,001,760)
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
**Hardware**: A10G, ~50 FPS
**Wall time**: ~2h 45min
**Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
**Key insight**: Reward decreased (180→153) but win rate increased (92%→93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.

### 4.3 Phase 3: Curriculum Self-Play (PENDING)

**Script**: `phase3_curriculum.py` (ready on Hub)
**Plan**: 5-stage rule-based curriculum — static → random → simple_bomb → evasive → mixed
**Duration**: 1M steps
**Advancement gate**: >55% win rate per stage

---

## 5. Results

### 5.1 Phase 1 Results

| Metric | Value |
|---|---|
| Timesteps | 500,352 |
| Final Reward | 237.0 |
| FPS | 52 (A10G) |
| Wall time | ~2h 15min |
| Win Rate (eval) | **92.0%** |
| Avg Reward (eval) | **180.1** |
| Survival Rate | **100.0%** |

### 5.2 Phase 2 Results

| Metric | Value |
|---|---|
| Timesteps | 1,001,760 total (500,408 new) |
| FPS | 50 (A10G) |
| Wall time | ~2h 45min |
| Win Rate (eval) | **93.0%** |
| Avg Reward (eval) | **153.4** |
| Avg Bombs | **20.1** |

---

## 6. Artifacts

| File | Purpose |
|---|---|
| `phase1_final.zip` | Phase 1 complete checkpoint |
| `phase2_final.zip` | Phase 2 complete checkpoint |
| `phase2_ckpt_*.zip` | Phase 2 intermediates (650k–1M) |
| `phase2_eval_results.txt` | Phase 2 evaluation metrics |
| `ae_manager.py` | Inference code |
| `docs/ae.md` | This documentation |

---

## 7. Next Steps

- [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
- [ ] Monitor 5-stage curriculum progression
- [ ] Evaluate final model vs mixed rule-based opponents
- [ ] Future: CNN policy, opponent modeling, LSTM memory

*Last updated: 2026-05-14 — Phase 2 complete, Phase 3 ready*