Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,26 +1,108 @@
|
|
| 1 |
-
--
|
| 2 |
-
tags:
|
| 3 |
-
- ml-intern
|
| 4 |
-
---
|
| 5 |
|
| 6 |
-
|
|
|
|
|
|
|
| 7 |
|
| 8 |
-
|
| 9 |
-
## Generated by ML Intern
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
- Source code: https://github.com/huggingface/ml-intern
|
| 15 |
|
| 16 |
-
##
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
```python
|
| 19 |
-
from
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
model =
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
|
|
| 1 |
+
# TIL-26-AE Bomberman Agent — MaskablePPO + Curriculum Learning
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
This repository contains the training pipeline for an RL agent competing in the
|
| 4 |
+
**TIL-26 Automated Exploration** (AE) challenge — a competitive multi-agent
|
| 5 |
+
Bomberman-like environment.
|
| 6 |
|
| 7 |
+
## 🎯 Challenge
|
|
|
|
| 8 |
|
| 9 |
+
[Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16×16 maze. Key challenges:
|
| 10 |
+
- **Partial observability** (directional viewcones, not full map)
|
| 11 |
+
- **Sparse terminal rewards** (±50 for base destroy/survival)
|
| 12 |
+
- **Procedural generation** (new maze every episode)
|
| 13 |
+
- **Risk of camping** near base without exploration signal
|
| 14 |
|
| 15 |
+
## 🏗️ Architecture
|
|
|
|
| 16 |
|
| 17 |
+
### Three-Phase Training Pipeline
|
| 18 |
|
| 19 |
+
| Phase | Description | Opponents | Key Technique |
|
| 20 |
+
|---|---|---|---|
|
| 21 |
+
| **1** | MaskablePPO baseline | Random valid actions | Invalid action masking |
|
| 22 |
+
| **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `α = 1 − tanh(k·deaths)` |
|
| 23 |
+
| **3** | Curriculum self-play | Rule-based (static → smart) | Elo-style difficulty progression |
|
| 24 |
+
|
| 25 |
+
### Design Decisions (Literature-Backed)
|
| 26 |
+
|
| 27 |
+
1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
|
| 28 |
+
2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
|
| 29 |
+
3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
|
| 30 |
+
4. **Curriculum learning**: 4 stages — static → simple → smart → mixed opponents. Advance at 55% win rate (or 500 episodes max).
|
| 31 |
+
|
| 32 |
+
### Key Papers
|
| 33 |
+
|
| 34 |
+
- **Pommerman multi-agent RL**: arxiv:2407.00662 — 98.85% win rate recipe
|
| 35 |
+
- **MAPPO best practices**: arxiv:2103.01955 — NeurIPS 2022
|
| 36 |
+
- **Invalid Action Masking**: arxiv:2006.14171 — theoretically justified
|
| 37 |
+
- **RND exploration** (fallback): arxiv:1810.12894 — if Phase 2 still camps
|
| 38 |
+
|
| 39 |
+
## 🚀 Running Training
|
| 40 |
+
|
| 41 |
+
### Prerequisites
|
| 42 |
+
```bash
|
| 43 |
+
# Download the environment (auto-bootstrapped in script)
|
| 44 |
+
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
### Local Training
|
| 48 |
+
```bash
|
| 49 |
+
export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
|
| 50 |
+
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
|
| 51 |
+
export TRACKIO_PROJECT="til-26-ae"
|
| 52 |
+
python train_all_phases.py
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
### HF Jobs (Recommended)
|
| 56 |
+
```bash
|
| 57 |
+
# Requires HF credits — run from a Space with the script uploaded
|
| 58 |
+
# Hardware: cpu-upgrade or a10g-large for GPU acceleration
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
## 📊 Monitoring
|
| 62 |
+
|
| 63 |
+
Trackio dashboard: `E-Rong/til-26-ae-trackio`
|
| 64 |
+
|
| 65 |
+
Logged metrics per phase:
|
| 66 |
+
- `train/mean_episode_reward`
|
| 67 |
+
- `train/mean_episode_length`
|
| 68 |
+
- `train/mean_explore_bonus` (Phase 2)
|
| 69 |
+
- `train/curriculum_stage` (Phase 3)
|
| 70 |
+
|
| 71 |
+
Alerts trigger on:
|
| 72 |
+
- Low reward (< -5) after 50k steps → suggests camping
|
| 73 |
+
- Curriculum stage advancement
|
| 74 |
+
|
| 75 |
+
## 📁 Repository Structure
|
| 76 |
+
|
| 77 |
+
```
|
| 78 |
+
train_all_phases.py # Full 3-phase pipeline
|
| 79 |
+
requirements.txt # Dependencies
|
| 80 |
+
bomberman_phase1_final.zip # Saved after Phase 1
|
| 81 |
+
bomberman_phase2_final.zip # Saved after Phase 2
|
| 82 |
+
bomberman_phase3_final.zip # Saved after Phase 3
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## 🧪 Evaluation
|
| 86 |
+
|
| 87 |
+
To evaluate a trained agent against random opponents:
|
| 88 |
```python
|
| 89 |
+
from train_all_phases import BombermanSingleAgentEnv
|
| 90 |
+
from sb3_contrib import MaskablePPO
|
| 91 |
+
from til_environment.config import default_config
|
| 92 |
|
| 93 |
+
cfg = default_config()
|
| 94 |
+
env = BombermanSingleAgentEnv(cfg=cfg)
|
| 95 |
+
model = MaskablePPO.load("bomberman_phase3_final")
|
| 96 |
+
|
| 97 |
+
obs, _ = env.reset(seed=42)
|
| 98 |
+
for _ in range(200):
|
| 99 |
+
action, _ = model.predict(obs, action_masks=env.action_masks())
|
| 100 |
+
obs, reward, done, truncated, info = env.step(action)
|
| 101 |
+
if done or truncated:
|
| 102 |
+
break
|
| 103 |
+
env.close()
|
| 104 |
```
|
| 105 |
|
| 106 |
+
## 📜 License
|
| 107 |
+
|
| 108 |
+
MIT — based on the TIL-26 AE challenge environment.
|