E-Rong's picture
Update docs: Phase 2 started, add interim results
a00144d verified

TIL-26-AE: Automated Exploration Bomberman Agent

Repository: E-Rong/til-26-ae-agent Challenge: The Intelligent League (TIL) — Automated Exploration (AE) Base Environment: e-rong/til-26-ae Space Model Repo: E-Rong/til-26-ae-agent (checkpoints + inference code)


Table of Contents

  1. Research & Literature Review
  2. Problem Analysis
  3. Development Decisions
  4. Training Phases
  5. Results
  6. Artifacts
  7. Next Steps

1. Research & Literature Review

1.1 Domain: Multi-Agent Bomberman RL

The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.

1.2 Key Papers

Paper arXiv ID Key Insight Relevance
Pommerman: A Multi-Agent Benchmark 2407.00662 PettingZoo + parallel env standard Confirmed approach
MAPPO 2103.01955 Shared parameters, curriculum Justified curriculum
Invalid Action Masking 2006.14171 Masks logits before softmax Directly applicable
PPO Algorithms 1707.06347 Clipped surrogate, stable Chosen over DQN

1.3 Why MaskablePPO?

Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes action_mask: uint8[6]. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

1.4 Why Curriculum Learning?

Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

1.5 Why Not DQN?

DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in sb3-contrib.


2. Problem Analysis

2.1 Environment Structure

  • Grid size: 16×16
  • Agents: Configurable (default 2 teams, Phase 3 uses 3)
  • Observations: Dict with agent_viewcone[7×5×25], base_viewcone[5×5×25], direction, location, health, action_mask[6], etc.
  • Actions: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
  • Episode length: ~200 steps

2.2 Observation Flattening

Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

2.3 Action Masking

Critical bug found: Monitor must wrap outside ActionMasker, not inside. Otherwise get_action_masks() fails because Monitor does not expose action_masks().


3. Development Decisions

3.1 Single-Agent Wrapper

Controls only agent_0; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

3.2 3-Phase Curriculum

Phase Opponent Duration Purpose
1 Random 500k Learn movement, bombs, basics
2 Random + exploration bonus 500k Prevent camping exploit
3 Rule-based curriculum 1M Generalize to structured opponents

3.3 Philosophy

  • stable-baselines3 for PPO core
  • sb3-contrib for MaskablePPO + ActionMasker
  • huggingface_hub for persistent checkpoint storage

3.4 Why Hub Every 50k Steps

Sandbox resets (T4 container recycling) caused local /app/data/ loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.


4. Training Phases

4.1 Phase 1: Foundation (vs Random)

Duration: 500,352 steps Result: Win rate 92%, avg reward 180.1, 100% survival Challenges: Wrapper ordering, dependency issues, sandbox resets

4.2 Phase 2: Exploration Shaping (IN PROGRESS)

Status: Started at 500352 steps, running on A10G at ~54 FPS Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) ETA: ~2.5 hours, targets 1,000,352 total steps Purpose: Force map exploration, prevent safe base-camping

4.3 Phase 3: Curriculum Self-Play

Pending: Rule-based static → simple → smart → mixed, 3 teams, 1M steps


5. Results

5.1 Phase 1 Results

Metric Value
Timesteps 500,352
Final Reward 237.0
FPS 52 (A10G)
Wall time ~2h 15min
Win Rate (eval) 92.0%
Avg Reward (eval) 180.1
Survival Rate 100.0%

5.2 Phase 2 Interim (Early)

Metric Value
Starting Step 500,352
Initial Reward (shaped) 210
FPS 54
Explore Weight Adaptive k=1.2

6. Artifacts

File Purpose
phase1_final.zip Trained model
phase2_final.zip (in progress)
ckpt_50000-400000.zip Phase 1 intermediates
ae_manager.py Inference code
docs/ae.md This documentation

7. Next Steps

  • Phase 2: Complete 500k exploration-shaping steps
  • Phase 3: Curriculum vs rule-based opponents (1M steps)
  • Eval: Multi-team evaluation vs smart opponents
  • Future: CNN policy, opponent modeling, LSTM memory

Last updated: 2026-05-14 — Phase 2 in progress