E-Rong commited on
Commit
5c6cad0
·
verified ·
1 Parent(s): 0b0cf6d

Add comprehensive documentation for TIL-26-AE project

Browse files
Files changed (1) hide show
  1. docs/ae.md +358 -0
docs/ae.md ADDED
@@ -0,0 +1,358 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TIL-26-AE: Automated Exploration Bomberman Agent
2
+
3
+ **Repository**: `E-Rong/til-26-ae-agent`
4
+ **Challenge**: The Intelligent League (TIL) — Automated Exploration (AE)
5
+ **Base Environment**: `e-rong/til-26-ae` Space
6
+ **Model Repo**: `E-Rong/til-26-ae-agent` (checkpoints + inference code)
7
+
8
+ ---
9
+
10
+ ## Table of Contents
11
+
12
+ 1. [Research & Literature Review](#1-research--literature-review)
13
+ 2. [Problem Analysis](#2-problem-analysis)
14
+ 3. [Development Decisions](#3-development-decisions)
15
+ 4. [Training Phases](#4-training-phases)
16
+ 5. [Results](#5-results)
17
+ 6. [Artifacts](#6-artifacts)
18
+ 7. [Next Steps](#7-next-steps)
19
+
20
+ ---
21
+
22
+ ## 1. Research & Literature Review
23
+
24
+ ### 1.1 Domain: Multi-Agent Bomberman RL
25
+
26
+ The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration** — agents must learn to navigate, compete, and survive without hand-crafted heuristics.
27
+
28
+ ### 1.2 Key Papers Consulted
29
+
30
+ | Paper | arXiv ID | Key Insight | Relevance |
31
+ |---|---|---|---|
32
+ | *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | Multi-agent competitive environment similar to Bomberman; MAPPO baseline performance | Confirmed PettingZoo + parallel env as standard approach |
33
+ | *The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games* | 2103.01955 | MAPPO with shared parameters, role-specific conditioning | Justified single-agent wrapper with self-play curriculum |
34
+ | *Superstition, Imagination, and the Invalid Action Problem* | 2006.14171 | Invalid action masking improves sample efficiency dramatically in discrete action spaces with legal action constraints | **Directly applicable** — Bomberman has wall/edge constraints |
35
+ | *Proximal Policy Optimization Algorithms* | 1707.06347 | PPO with clipped surrogate objective, stable and scalable | Chosen over DQN for continuous policy updates and easier masking integration |
36
+
37
+ ### 1.3 Why MaskablePPO?
38
+
39
+ After reading `arxiv:2006.14171`, we identified that **invalid action masking** is critical for this domain:
40
+
41
+ - Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile
42
+ - The observation includes `action_mask: uint8[6]` — a binary legal-action indicator
43
+ - Standard PPO would waste ~30-40% of samples on illegal moves early in training
44
+ - `sb3-contrib`'s `MaskablePPO` masks logits before softmax, ensuring only legal actions are sampled
45
+
46
+ **Decision**: Use `sb3-contrib`'s `MaskablePPO` with `ActionMasker` wrapper.
47
+
48
+ ### 1.4 Why Curriculum Learning?
49
+
50
+ From `arxiv:2103.01955` (MAPPO) and Pommerman benchmarks, we learned:
51
+
52
+ - Training against strong opponents from scratch leads to **catastrophic early losses** (~0 reward)
53
+ - Curriculum learning (easy → hard) is standard practice in competitive multi-agent RL
54
+ - Rule-based opponents at increasing difficulty provide stable reward signals during learning
55
+
56
+ **Decision**: Implement a 3-phase curriculum with adaptive difficulty gating.
57
+
58
+ ### 1.5 Why Not DQN / Rainbow?
59
+
60
+ - DQN struggles with action masking (requires custom architecture)
61
+ - PPO's on-policy updates handle the non-stationarity of multi-agent self-play better
62
+ - PPO is simpler to tune and has mature invalid-action-masking support in `sb3-contrib`
63
+
64
+ ---
65
+
66
+ ## 2. Problem Analysis
67
+
68
+ ### 2.1 Environment Structure
69
+
70
+ The `til-26-ae` environment (`e-rong/til-26-ae` Space) is a PettingZoo-style AEC (Agent-Environment-Commutative) multi-agent game:
71
+
72
+ - **Grid size**: 16×16 (confirmed from `default_config()`)
73
+ - **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
74
+ - **Observations**: Dict with:
75
+ - `agent_viewcone`: float32 [7×5×25] — agent-facing view
76
+ - `base_viewcone`: float32 [5×5×25] — base-centered view
77
+ - `direction`: Discrete(4) — facing
78
+ - `location`, `base_location`, `health`, `frozen_ticks`, `base_health`, `team_resources`, `team_bombs`, `step`
79
+ - `action_mask`: uint8[6] — binary legality mask
80
+ - **Actions**: Discrete(6)
81
+ - 0 = FORWARD, 1 = BACKWARD, 2 = LEFT, 3 = RIGHT, 4 = STAY, 5 = PLACE_BOMB
82
+ - **Episode length**: ~200 steps (observed during training)
83
+
84
+ ### 2.2 Observation Flattening
85
+
86
+ We flatten the dict observation into a **1511-dim vector**:
87
+
88
+ ```
89
+ agent_viewcone: 7 × 5 × 25 = 875
90
+ base_viewcone: 5 × 5 × 25 = 625
91
+ direction: 1
92
+ location: 2
93
+ base_location: 2
94
+ health: 1
95
+ frozen_ticks: 1
96
+ base_health: 1
97
+ team_resources: 1
98
+ team_bombs: 1
99
+ step: 1
100
+ ─────────────────────────────
101
+ TOTAL: 1511
102
+ ```
103
+
104
+ This matches the MLP policy input in `MaskablePPO("MlpPolicy", ...)`.
105
+
106
+ ### 2.3 Action Masking Implementation
107
+
108
+ ```python
109
+ env = ActionMasker(base_env, lambda e: e.action_masks())
110
+ ```
111
+
112
+ The wrapper exposes `action_masks()` which returns a bool[6] array. `MaskablePPO` uses this internally via `sb3_contrib`'s `get_action_masks()` during rollout collection.
113
+
114
+ **Critical bug found**: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. We fixed this ordering issue during development.
115
+
116
+ ---
117
+
118
+ ## 3. Development Decisions
119
+
120
+ ### 3.1 Why Single-Agent Wrapper?
121
+
122
+ The TIL environment is inherently multi-agent (PettingZoo AEC). However, for the AE challenge, we only control **agent_0**; opponents use fixed policies. We wrapped the parallel PettingZoo env into a `gymnasium.Env` that:
123
+
124
+ - Runs the full multi-agent step
125
+ - Returns only agent_0's observation/reward/done
126
+ - Uses random valid actions for opponents (Phase 1-2) or rule-based policies (Phase 3)
127
+
128
+ This reduces the problem to single-agent RL with a non-stationary environment (opponent policies change between phases).
129
+
130
+ ### 3.2 Why 3-Phase Curriculum?
131
+
132
+ | Phase | Opponent | Duration | Purpose |
133
+ |---|---|---|---|
134
+ | **1** | Random valid actions | 500k steps | Learn basic movement, bomb mechanics, map navigation |
135
+ | **2** | Random + exploration shaping | 500k steps | Prevent "camping" exploit; encourage full map coverage |
136
+ | **3** | Rule-based (curriculum) | 1M steps | Generalize to structured opponents; scale to multi-team |
137
+
138
+ **Phase 1 vs Random** gives the agent a chance to learn fundamentals without being immediately killed by competent opponents. Random opponents still place bombs and move, providing exposure to explosion mechanics.
139
+
140
+ **Phase 2 Exploration Shaping** addresses a known issue: agents learn to survive by staying near their base and waiting for random opponents to walk into bombs. The visit-count bonus (`1/(1+visits)`) forces the agent to explore new tiles.
141
+
142
+ **Phase 3 Curriculum** transitions from random to structured opponents using a difficulty ladder: static → simple → smart → mixed. This mirrors how humans learn and prevents the "forgetting" problem when suddenly switching opponent types.
143
+
144
+ ### 3.3 Why Stable-Baselines3 + sb3-contrib?
145
+
146
+ | Library | Role |
147
+ |---|---|
148
+ | `stable-baselines3` | Core PPO implementation, callbacks, Monitor, checkpoints |
149
+ | `sb3-contrib` | `MaskablePPO`, `ActionMasker`, invalid-action masking utilities |
150
+ | `gymnasium` | Env API (observation/action spaces, step/reset) |
151
+ | `pettingzoo` | Multi-agent env conversion (`aec_to_parallel`) |
152
+ | `huggingface_hub` | Push checkpoints to persistent storage |
153
+
154
+ ### 3.4 Why Push Checkpoints to Hub Every 50k Steps?
155
+
156
+ During development, we encountered **sandbox resets** (T4 container recycled unexpectedly). Local `/app/data/` was lost, but the Hub model repo (`E-Rong/til-26-ae-agent`) persisted.
157
+
158
+ **Decision**: Implement a dual-save strategy:
159
+ - Local: `CheckpointCallback(save_freq=50000)`
160
+ - Hub: Custom callback calling `HfApi.upload_file()` every 50k steps
161
+
162
+ This saved the project when the sandbox reset at 400k steps — we resumed from `ckpt_400000.zip` on the Hub without losing progress.
163
+
164
+ ### 3.5 Why A10G over T4?
165
+
166
+ | Hardware | FPS | Time for 100k steps |
167
+ |---|---|---|
168
+ | T4 | ~42 | ~40 min |
169
+ | A10G | ~52 | ~32 min |
170
+
171
+ A10G provided more stable performance and the same 24GB VRAM. Given the ~2M total steps across 3 phases, A10G saves ~2 hours total.
172
+
173
+ ---
174
+
175
+ ## 4. Training Phases
176
+
177
+ ### 4.1 Phase 1: Foundation (MaskablePPO vs Random)
178
+
179
+ **Duration**: 500,000 steps
180
+ **Opponent**: Random valid actions
181
+ **Environment**: 2 teams, agent_0 vs random
182
+ **Hyperparameters**:
183
+ ```python
184
+ MaskablePPO(
185
+ "MlpPolicy", env,
186
+ learning_rate=3e-4,
187
+ n_steps=2048,
188
+ batch_size=64,
189
+ n_epochs=10,
190
+ gamma=0.99,
191
+ gae_lambda=0.95,
192
+ clip_range=0.2,
193
+ ent_coef=0.01, # Encourage exploration early
194
+ vf_coef=0.5,
195
+ max_grad_norm=0.5,
196
+ device="cuda",
197
+ )
198
+ ```
199
+
200
+ **Purpose**: Learn basic survival, map layout, bomb mechanics, and opponent interaction. Random opponents provide a low-stakes environment where the agent can experiment without being immediately eliminated.
201
+
202
+ **Challenges encountered**:
203
+ - Initial wrapper ordering bug (`Monitor` inside `ActionMasker`)
204
+ - Missing dependencies (`omegaconf`, `perlin_noise`) in fresh sandboxes
205
+ - Sandbox resets — resolved by Hub checkpointing
206
+
207
+ ### 4.2 Phase 2: Exploration Shaping (Adaptive Annealing)
208
+
209
+ **Duration**: 500,000 steps
210
+ **Opponent**: Random valid actions
211
+ **Environment**: 2 teams + exploration bonus
212
+ **Mechanism**:
213
+ ```python
214
+ # Visit-count bonus
215
+ visit_bonus = 1.0 / (1.0 + visit_counts[x, y])
216
+
217
+ # Adaptive annealing
218
+ alpha = 1.0 - tanh(k * avg_enemy_deaths)
219
+ explore_weight = base_weight * max(0.1, alpha)
220
+ ```
221
+
222
+ As the agent gets better at killing enemies, the exploration bonus fades, shifting focus toward combat optimization.
223
+
224
+ **Purpose**: Prevent the "camping" exploit where agents hide near their base and wait. Force proactive map exploration and resource collection.
225
+
226
+ ### 4.3 Phase 3: Curriculum Self-Play (Rule-Based Opponents)
227
+
228
+ **Duration**: 1,000,000 steps
229
+ **Opponent**: Rule-based with curriculum difficulty
230
+ **Environment**: 3 teams
231
+ **Curriculum stages**:
232
+ 1. **Static**: Opponents do nothing (STAY)
233
+ 2. **Simple**: Bomb when enemy in viewcone, otherwise random move
234
+ 3. **Smart**: Score-based action selection (collectibles + wall avoidance)
235
+ 4. **Mixed**: Half smart, half simple
236
+
237
+ **Advancement condition**: Win rate ≥ 55% over 500 episodes, or max 500 episodes per stage.
238
+
239
+ **Purpose**: Generalize to structured, multi-team competition. The curriculum ensures the agent doesn't face a difficulty cliff when switching from random to competent opponents.
240
+
241
+ ---
242
+
243
+ ## 5. Results
244
+
245
+ ### 5.1 Phase 1 Results
246
+
247
+ | Metric | Value |
248
+ |---|---|
249
+ | **Timesteps** | 500,352 |
250
+ | **Final Training Reward** | 237.0 |
251
+ | **FPS** | 52 (A10G) |
252
+ | **Total wall time** | ~2h 15min |
253
+ | **Checkpoints** | ckpt_50000, 100000, 150000, 200000, 250000, 300000, 350000, 400000 + phase1_final |
254
+
255
+ ### 5.2 Phase 1 Evaluation (100 Episodes vs Random Opponents)
256
+
257
+ | Metric | Value |
258
+ |---|---|
259
+ | **Win Rate** | **92.0%** (92/100) |
260
+ | **Average Reward** | **180.1** |
261
+ | **Average Episode Length** | 200.0 steps |
262
+ | **Average Bombs/Episode** | 20.4 |
263
+ | **Survival Rate** | **100.0%** |
264
+
265
+ **Interpretation**: The agent has mastered the basics against random opponents. It consistently survives full episodes, places bombs frequently, and wins nearly every match. The gap between training reward (237) and eval reward (180) suggests some reward shaping (e.g., exploration bonus) during training that doesn't transfer to deterministic eval.
266
+
267
+ ### 5.3 Training Reward Trajectory (Phase 1)
268
+
269
+ | Steps | Episode Reward | Notes |
270
+ |---|---|---|
271
+ | 2,048 | 41.4 | Initial random policy |
272
+ | 20,480 | ~104 | Learning movement |
273
+ | 53,248 | ~116 | First checkpoint |
274
+ | 110,592 | ~159 | Consistent improvement |
275
+ | 204,096 | ~219 | Strong policy emerging |
276
+ | 306,496 | ~203 | Slight dip (exploration) |
277
+ | 416,384 | ~224 | Convergence |
278
+ | 500,352 | 237.0 | Final |
279
+
280
+ ---
281
+
282
+ ## 6. Artifacts
283
+
284
+ ### 6.1 Model Repo (`E-Rong/til-26-ae-agent`)
285
+
286
+ | File | Purpose |
287
+ |---|---|
288
+ | `phase1_final.zip` | Phase 1 trained model (500k steps) |
289
+ | `ckpt_50000.zip` – `ckpt_400000.zip` | Intermediate checkpoints |
290
+ | `ae_manager.py` | Inference code for AE server |
291
+ | `phase1_eval_results.txt` | Raw evaluation numbers |
292
+ | `phase1_summary.txt` | This summary (abridged) |
293
+ | `train_all_phases.py` | Full training script |
294
+ | `train_in_space.py` | Space-compatible training script |
295
+ | `requirements.txt` | Python dependencies |
296
+
297
+ ### 6.2 Space Integration (`e-rong/til-26-ae`)
298
+
299
+ | File | Purpose |
300
+ |---|---|
301
+ | `ae/src/ae_manager.py` | Loads `phase1_final.zip` from Hub, serves actions via `/ae` endpoint |
302
+ | `ae/requirements.txt` | `sb3-contrib`, `torch`, `huggingface_hub` |
303
+ | `ae/Dockerfile` | Standard Python 3.11 image, CPU-only for fast eval startup |
304
+
305
+ ### 6.3 How Inference Works
306
+
307
+ 1. AE server receives `POST /ae` with observation dict
308
+ 2. `AEManager.ae(observation)` flattens obs → 1511-dim vector
309
+ 3. Loads `MaskablePPO` from `phase1_final.zip` (cached at `/workspace/models/`)
310
+ 4. Calls `model.predict(obs_vec, action_masks=mask, deterministic=True)`
311
+ 5. Returns action int in [0, 5]
312
+
313
+ **Fallback**: If no model found, returns random valid action.
314
+
315
+ ---
316
+
317
+ ## 7. Next Steps
318
+
319
+ ### 7.1 Phase 2: Exploration Shaping (Pending)
320
+
321
+ - Load `phase1_final.zip`
322
+ - Add `RewardShapingWrapper` with adaptive visit-count bonus
323
+ - Train 500k steps vs random
324
+ - Expected: Higher map coverage, less base-camping, similar win rate
325
+
326
+ ### 7.2 Phase 3: Curriculum Self-Play (Pending)
327
+
328
+ - Load Phase 2 final model
329
+ - Configure 3-team environment
330
+ - Progress through rule-based opponent difficulty
331
+ - Expected: Win rate drops initially, then recovers as curriculum advances
332
+
333
+ ### 7.3 Evaluation Against Non-Random Opponents (Pending)
334
+
335
+ - Evaluate Phase 3 model vs rule-based "smart" opponents
336
+ - Target: > 50% win rate against smart opponents
337
+ - Multi-team evaluation (3-way matches)
338
+
339
+ ### 7.4 Known Limitations
340
+
341
+ - **No recurrent policy**: The MLP policy has no memory of past observations. May struggle with bomb fuse timing or opponent tracking.
342
+ - **No opponent modeling**: The policy treats opponent actions as environment noise. Could benefit from opponent ID or history encoding.
343
+ - **Flattened observations**: Dict observations with spatial structure (viewcones) are flattened into vectors. A CNN policy might exploit spatial patterns better.
344
+ - **Deterministic eval**: Currently uses `deterministic=True` for evaluation. Stochastic evaluation might reveal policy variance.
345
+
346
+ ### 7.5 Future Improvements
347
+
348
+ 1. **CNN policy**: Use `CnnPolicy` with 2D viewcone inputs instead of MLP
349
+ 2. **LSTM/GRU**: Add memory for temporal opponent tracking
350
+ 3. **Self-play proper**: Train both teams simultaneously with shared/policy-separated networks
351
+ 4. **Population-based training**: Train a population of agents and evaluate against each other
352
+ 5. **Reward decomposition**: Separate rewards for movement, bombs, kills, survival, resource collection
353
+
354
+ ---
355
+
356
+ *Last updated: 2026-05-14*
357
+ *Current phase: Phase 1 complete, Phase 2 pending*
358
+ *Author: ML Intern (E-Rong)*