Update docs: Phase 2 started, add interim results

a00144d verified about 7 hours ago

5.35 kB

TIL-26-AE: Automated Exploration Bomberman Agent

Repository: E-Rong/til-26-ae-agent Challenge: The Intelligent League (TIL) — Automated Exploration (AE) Base Environment: e-rong/til-26-ae Space Model Repo: E-Rong/til-26-ae-agent (checkpoints + inference code)

Research & Literature Review
Problem Analysis
Development Decisions
Training Phases
Results
Artifacts
Next Steps

1. Research & Literature Review

1.1 Domain: Multi-Agent Bomberman RL

The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.

1.2 Key Papers

Paper	arXiv ID	Key Insight	Relevance
Pommerman: A Multi-Agent Benchmark	2407.00662	PettingZoo + parallel env standard	Confirmed approach
MAPPO	2103.01955	Shared parameters, curriculum	Justified curriculum
Invalid Action Masking	2006.14171	Masks logits before softmax	Directly applicable
PPO Algorithms	1707.06347	Clipped surrogate, stable	Chosen over DQN

1.3 Why MaskablePPO?

Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes action_mask: uint8[6]. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

1.4 Why Curriculum Learning?

Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

1.5 Why Not DQN?

DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in sb3-contrib.

2. Problem Analysis

2.1 Environment Structure

Grid size: 16×16
Agents: Configurable (default 2 teams, Phase 3 uses 3)
Observations: Dict with agent_viewcone[7×5×25], base_viewcone[5×5×25], direction, location, health, action_mask[6], etc.
Actions: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
Episode length: ~200 steps

2.2 Observation Flattening

Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

2.3 Action Masking

Critical bug found: Monitor must wrap outside ActionMasker, not inside. Otherwise get_action_masks() fails because Monitor does not expose action_masks().

3. Development Decisions

3.1 Single-Agent Wrapper

Controls only agent_0; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

3.2 3-Phase Curriculum

Phase	Opponent	Duration	Purpose
1	Random	500k	Learn movement, bombs, basics
2	Random + exploration bonus	500k	Prevent camping exploit
3	Rule-based curriculum	1M	Generalize to structured opponents

3.3 Philosophy

stable-baselines3 for PPO core
sb3-contrib for MaskablePPO + ActionMasker
huggingface_hub for persistent checkpoint storage

3.4 Why Hub Every 50k Steps

Sandbox resets (T4 container recycling) caused local /app/data/ loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.

4. Training Phases

4.1 Phase 1: Foundation (vs Random)

Duration: 500,352 steps Result: Win rate 92%, avg reward 180.1, 100% survival Challenges: Wrapper ordering, dependency issues, sandbox resets

4.2 Phase 2: Exploration Shaping (IN PROGRESS)

Status: Started at 500352 steps, running on A10G at ~54 FPS Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths) ETA: ~2.5 hours, targets 1,000,352 total steps Purpose: Force map exploration, prevent safe base-camping

4.3 Phase 3: Curriculum Self-Play

Pending: Rule-based static → simple → smart → mixed, 3 teams, 1M steps

5. Results

5.1 Phase 1 Results

Metric	Value
Timesteps	500,352
Final Reward	237.0
FPS	52 (A10G)
Wall time	~2h 15min
Win Rate (eval)	92.0%
Avg Reward (eval)	180.1
Survival Rate	100.0%

5.2 Phase 2 Interim (Early)

Metric	Value
Starting Step	500,352
Initial Reward (shaped)	210
FPS	54
Explore Weight	Adaptive k=1.2

6. Artifacts

File	Purpose
`phase1_final.zip`	Trained model
`phase2_final.zip`	(in progress)
`ckpt_50000-400000.zip`	Phase 1 intermediates
`ae_manager.py`	Inference code
`docs/ae.md`	This documentation

7. Next Steps

Phase 2: Complete 500k exploration-shaping steps
Phase 3: Curriculum vs rule-based opponents (1M steps)
Eval: Multi-team evaluation vs smart opponents
Future: CNN policy, opponent modeling, LSTM memory

Last updated: 2026-05-14 — Phase 2 in progress

E-Rong
/

til-26-ae-agent