File size: 5,346 Bytes

5c6cad0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a00144d
5c6cad0
a00144d
5c6cad0
 
 
a00144d
 
 
 
5c6cad0
 
 
a00144d
5c6cad0
 
 
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
 
 
 
 
a00144d
5c6cad0
a00144d
 
 
5c6cad0
 
 
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
 
 
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
a00144d
 
 
5c6cad0
a00144d
5c6cad0
a00144d
 
 
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
a00144d
 
 
5c6cad0
a00144d
5c6cad0
a00144d
 
 
 
5c6cad0
a00144d
5c6cad0
a00144d
5c6cad0
 
 
 
 
 
 
 
 
a00144d
 
 
 
 
 
 
5c6cad0
a00144d
5c6cad0
 
 
a00144d
 
 
 
5c6cad0
 
 
 
 
 
 
a00144d
 
 
 
 
5c6cad0
 
 
 
 
a00144d
 
 
 
5c6cad0
a00144d

# TIL-26-AE: Automated Exploration Bomberman Agent

**Repository**: `E-Rong/til-26-ae-agent`
**Challenge**: The Intelligent League (TIL) — Automated Exploration (AE)
**Base Environment**: `e-rong/til-26-ae` Space
**Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code)

---

## Table of Contents

1. [Research & Literature Review](#1-research--literature-review)
2. [Problem Analysis](#2-problem-analysis)
3. [Development Decisions](#3-development-decisions)
4. [Training Phases](#4-training-phases)
5. [Results](#5-results)
6. [Artifacts](#6-artifacts)
7. [Next Steps](#7-next-steps)

---

## 1. Research & Literature Review

### 1.1 Domain: Multi-Agent Bomberman RL

The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.

### 1.2 Key Papers

| Paper | arXiv ID | Key Insight | Relevance |
|---|---|---|---|
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |

### 1.3 Why MaskablePPO?

Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

### 1.4 Why Curriculum Learning?

Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

### 1.5 Why Not DQN?

DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.

---

## 2. Problem Analysis

### 2.1 Environment Structure

- **Grid size**: 16×16
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
- **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
- **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
- **Episode length**: ~200 steps

### 2.2 Observation Flattening

Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

### 2.3 Action Masking

Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.

---

## 3. Development Decisions

### 3.1 Single-Agent Wrapper

Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

### 3.2 3-Phase Curriculum

| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| **1** | Random | 500k | Learn movement, bombs, basics |
| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |

### 3.3 Philosophy

- `stable-baselines3` for PPO core
- `sb3-contrib` for MaskablePPO + ActionMasker
- `huggingface_hub` for persistent checkpoint storage

### 3.4 Why Hub Every 50k Steps

Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.

---

## 4. Training Phases

### 4.1 Phase 1: Foundation (vs Random)

**Duration**: 500,352 steps
**Result**: Win rate 92%, avg reward 180.1, 100% survival
**Challenges**: Wrapper ordering, dependency issues, sandbox resets

### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)

**Status**: Started at 500352 steps, running on A10G at ~54 FPS
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
**ETA**: ~2.5 hours, targets 1,000,352 total steps
**Purpose**: Force map exploration, prevent safe base-camping

### 4.3 Phase 3: Curriculum Self-Play

**Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps

---

## 5. Results

### 5.1 Phase 1 Results

| Metric | Value |
|---|---|
| Timesteps | 500,352 |
| Final Reward | 237.0 |
| FPS | 52 (A10G) |
| Wall time | ~2h 15min |
| Win Rate (eval) | **92.0%** |
| Avg Reward (eval) | **180.1** |
| Survival Rate | **100.0%** |

### 5.2 Phase 2 Interim (Early)

| Metric | Value |
|---|---|
| Starting Step | 500,352 |
| Initial Reward (shaped) | 210 |
| FPS | 54 |
| Explore Weight | Adaptive k=1.2 |

---

## 6. Artifacts

| File | Purpose |
|---|---|
| `phase1_final.zip` | Trained model |
| `phase2_final.zip` | *(in progress)* |
| `ckpt_50000-400000.zip` | Phase 1 intermediates |
| `ae_manager.py` | Inference code |
| `docs/ae.md` | This documentation |

---

## 7. Next Steps

- **Phase 2**: Complete 500k exploration-shaping steps
- **Phase 3**: Curriculum vs rule-based opponents (1M steps)
- **Eval**: Multi-team evaluation vs smart opponents
- **Future**: CNN policy, opponent modeling, LSTM memory

*Last updated: 2026-05-14 — Phase 2 in progress*