Update docs: Phase 2 complete, Phase 3 ready

47c41e4 verified about 1 hour ago

5.82 kB

	# TIL-26-AE: Automated Exploration Bomberman Agent

	Repository: `E-Rong/til-26-ae-agent`
	Challenge: The Intelligent League (TIL) — Automated Exploration (AE)
	Base Environment: `e-rong/til-26-ae` Space
	Model Repo: `E-Rong/til-26-ae-agent` (checkpoints + inference code)

	---

	## Table of Contents

	1. [Research & Literature Review](#1-research--literature-review)
	2. [Problem Analysis](#2-problem-analysis)
	3. [Development Decisions](#3-development-decisions)
	4. [Training Phases](#4-training-phases)
	5. [Results](#5-results)
	6. [Artifacts](#6-artifacts)
	7. [Next Steps](#7-next-steps)

	---

	## 1. Research & Literature Review

	### 1.1 Domain: Multi-Agent Bomberman RL

	The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.

	### 1.2 Key Papers

	\| Paper \| arXiv ID \| Key Insight \| Relevance \|
	\|---\|---\|---\|---\|
	\| Pommerman: A Multi-Agent Benchmark \| 2407.00662 \| PettingZoo + parallel env standard \| Confirmed approach \|
	\| MAPPO \| 2103.01955 \| Shared parameters, curriculum \| Justified curriculum \|
	\| Invalid Action Masking \| 2006.14171 \| Masks logits before softmax \| Directly applicable \|
	\| PPO Algorithms \| 1707.06347 \| Clipped surrogate, stable \| Chosen over DQN \|

	### 1.3 Why MaskablePPO?

	Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

	### 1.4 Why Curriculum Learning?

	Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

	### 1.5 Why Not DQN?

	DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.

	---

	## 2. Problem Analysis

	### 2.1 Environment Structure

	- Grid size: 16×16
	- Agents: Configurable (default 2 teams, Phase 3 uses 3)
	- Observations: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
	- Actions: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
	- Episode length: ~200 steps

	### 2.2 Observation Flattening

	Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

	### 2.3 Action Masking

	Critical bug found: `Monitor` must wrap outside `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.

	---

	## 3. Development Decisions

	### 3.1 Single-Agent Wrapper

	Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

	### 3.2 3-Phase Curriculum

	\| Phase \| Opponent \| Duration \| Purpose \|
	\|---\|---\|---\|---\|
	\| 1 \| Random \| 500k \| Learn movement, bombs, basics \|
	\| 2 \| Random + exploration bonus \| 500k \| Prevent camping exploit \|
	\| 3 \| Rule-based curriculum \| 1M \| Generalize to structured opponents \|

	### 3.3 Philosophy

	- `stable-baselines3` for PPO core
	- `sb3-contrib` for MaskablePPO + ActionMasker
	- `huggingface_hub` for persistent checkpoint storage

	### 3.4 Why Hub Every 50k Steps

	Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.

	---

	## 4. Training Phases

	### 4.1 Phase 1: Foundation (vs Random)

	Duration: 500,352 steps
	Result: Win rate 92%, avg reward 180.1, 100% survival
	Challenges: Wrapper ordering, dependency issues, sandbox resets

	### 4.2 Phase 2: Exploration Shaping (COMPLETE)

	Duration: 500,408 additional steps (600,352 → 1,001,760)
	Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
	Hardware: A10G, ~50 FPS
	Wall time: ~2h 45min
	Result: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
	Key insight: Reward decreased (180→153) but win rate increased (92%→93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.

	### 4.3 Phase 3: Curriculum Self-Play (PENDING)

	Script: `phase3_curriculum.py` (ready on Hub)
	Plan: 5-stage rule-based curriculum — static → random → simple_bomb → evasive → mixed
	Duration: 1M steps
	Advancement gate: >55% win rate per stage

	---

	## 5. Results

	### 5.1 Phase 1 Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Timesteps \| 500,352 \|
	\| Final Reward \| 237.0 \|
	\| FPS \| 52 (A10G) \|
	\| Wall time \| ~2h 15min \|
	\| Win Rate (eval) \| 92.0% \|
	\| Avg Reward (eval) \| 180.1 \|
	\| Survival Rate \| 100.0% \|

	### 5.2 Phase 2 Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Timesteps \| 1,001,760 total (500,408 new) \|
	\| FPS \| 50 (A10G) \|
	\| Wall time \| ~2h 45min \|
	\| Win Rate (eval) \| 93.0% \|
	\| Avg Reward (eval) \| 153.4 \|
	\| Avg Bombs \| 20.1 \|

	---

	## 6. Artifacts

	\| File \| Purpose \|
	\|---\|---\|
	\| `phase1_final.zip` \| Phase 1 complete checkpoint \|
	\| `phase2_final.zip` \| Phase 2 complete checkpoint \|
	\| `phase2_ckpt_*.zip` \| Phase 2 intermediates (650k–1M) \|
	\| `phase2_eval_results.txt` \| Phase 2 evaluation metrics \|
	\| `ae_manager.py` \| Inference code \|
	\| `docs/ae.md` \| This documentation \|

	---

	## 7. Next Steps

	- [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
	- [ ] Monitor 5-stage curriculum progression
	- [ ] Evaluate final model vs mixed rule-based opponents
	- [ ] Future: CNN policy, opponent modeling, LSTM memory

	Last updated: 2026-05-14 — Phase 2 complete, Phase 3 ready

	# TIL-26-AE: Automated Exploration Bomberman Agent

	Repository: `E-Rong/til-26-ae-agent`
	Challenge: The Intelligent League (TIL) — Automated Exploration (AE)
	Base Environment: `e-rong/til-26-ae` Space
	Model Repo: `E-Rong/til-26-ae-agent` (checkpoints + inference code)

	---

	## Table of Contents

	1. [Research & Literature Review](#1-research--literature-review)
	2. [Problem Analysis](#2-problem-analysis)
	3. [Development Decisions](#3-development-decisions)
	4. [Training Phases](#4-training-phases)
	5. [Results](#5-results)
	6. [Artifacts](#6-artifacts)
	7. [Next Steps](#7-next-steps)

	---

	## 1. Research & Literature Review

	### 1.1 Domain: Multi-Agent Bomberman RL

	The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is autonomous exploration.

	### 1.2 Key Papers

	\| Paper \| arXiv ID \| Key Insight \| Relevance \|
	\|---\|---\|---\|---\|
	\| Pommerman: A Multi-Agent Benchmark \| 2407.00662 \| PettingZoo + parallel env standard \| Confirmed approach \|
	\| MAPPO \| 2103.01955 \| Shared parameters, curriculum \| Justified curriculum \|
	\| Invalid Action Masking \| 2006.14171 \| Masks logits before softmax \| Directly applicable \|
	\| PPO Algorithms \| 1707.06347 \| Clipped surrogate, stable \| Chosen over DQN \|

	### 1.3 Why MaskablePPO?

	Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.

	### 1.4 Why Curriculum Learning?

	Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.

	### 1.5 Why Not DQN?

	DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.

	---

	## 2. Problem Analysis

	### 2.1 Environment Structure

	- Grid size: 16×16
	- Agents: Configurable (default 2 teams, Phase 3 uses 3)
	- Observations: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
	- Actions: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
	- Episode length: ~200 steps

	### 2.2 Observation Flattening

	Flattened to 1511-dim vector: agent_viewcone(875) + base_viewcone(625) + 11 scalars.

	### 2.3 Action Masking

	Critical bug found: `Monitor` must wrap outside `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.

	---

	## 3. Development Decisions

	### 3.1 Single-Agent Wrapper

	Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.

	### 3.2 3-Phase Curriculum

	\| Phase \| Opponent \| Duration \| Purpose \|
	\|---\|---\|---\|---\|
	\| 1 \| Random \| 500k \| Learn movement, bombs, basics \|
	\| 2 \| Random + exploration bonus \| 500k \| Prevent camping exploit \|
	\| 3 \| Rule-based curriculum \| 1M \| Generalize to structured opponents \|

	### 3.3 Philosophy

	- `stable-baselines3` for PPO core
	- `sb3-contrib` for MaskablePPO + ActionMasker
	- `huggingface_hub` for persistent checkpoint storage

	### 3.4 Why Hub Every 50k Steps

	Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.

	---

	## 4. Training Phases

	### 4.1 Phase 1: Foundation (vs Random)

	Duration: 500,352 steps
	Result: Win rate 92%, avg reward 180.1, 100% survival
	Challenges: Wrapper ordering, dependency issues, sandbox resets

	### 4.2 Phase 2: Exploration Shaping (COMPLETE)

	Duration: 500,408 additional steps (600,352 → 1,001,760)
	Mechanism: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
	Hardware: A10G, ~50 FPS
	Wall time: ~2h 45min
	Result: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
	Key insight: Reward decreased (180→153) but win rate increased (92%→93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.

	### 4.3 Phase 3: Curriculum Self-Play (PENDING)

	Script: `phase3_curriculum.py` (ready on Hub)
	Plan: 5-stage rule-based curriculum — static → random → simple_bomb → evasive → mixed
	Duration: 1M steps
	Advancement gate: >55% win rate per stage

	---

	## 5. Results

	### 5.1 Phase 1 Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Timesteps \| 500,352 \|
	\| Final Reward \| 237.0 \|
	\| FPS \| 52 (A10G) \|
	\| Wall time \| ~2h 15min \|
	\| Win Rate (eval) \| 92.0% \|
	\| Avg Reward (eval) \| 180.1 \|
	\| Survival Rate \| 100.0% \|

	### 5.2 Phase 2 Results

	\| Metric \| Value \|
	\|---\|---\|
	\| Timesteps \| 1,001,760 total (500,408 new) \|
	\| FPS \| 50 (A10G) \|
	\| Wall time \| ~2h 45min \|
	\| Win Rate (eval) \| 93.0% \|
	\| Avg Reward (eval) \| 153.4 \|
	\| Avg Bombs \| 20.1 \|

	---

	## 6. Artifacts

	\| File \| Purpose \|
	\|---\|---\|
	\| `phase1_final.zip` \| Phase 1 complete checkpoint \|
	\| `phase2_final.zip` \| Phase 2 complete checkpoint \|
	\| `phase2_ckpt_*.zip` \| Phase 2 intermediates (650k–1M) \|
	\| `phase2_eval_results.txt` \| Phase 2 evaluation metrics \|
	\| `ae_manager.py` \| Inference code \|
	\| `docs/ae.md` \| This documentation \|

	---

	## 7. Next Steps

	- [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
	- [ ] Monitor 5-stage curriculum progression
	- [ ] Evaluate final model vs mixed rule-based opponents
	- [ ] Future: CNN policy, opponent modeling, LSTM memory

	Last updated: 2026-05-14 — Phase 2 complete, Phase 3 ready