E-Rong commited on
Commit
91a90a5
·
verified ·
1 Parent(s): 7be626a

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +98 -16
README.md CHANGED
@@ -1,26 +1,108 @@
1
- ---
2
- tags:
3
- - ml-intern
4
- ---
5
 
6
- # E-Rong/til-26-ae-agent
 
 
7
 
8
- <!-- ml-intern-provenance -->
9
- ## Generated by ML Intern
10
 
11
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
 
 
 
 
12
 
13
- - Try ML Intern: https://smolagents-ml-intern.hf.space
14
- - Source code: https://github.com/huggingface/ml-intern
15
 
16
- ## Usage
17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ```python
19
- from transformers import AutoModelForCausalLM, AutoTokenizer
 
 
20
 
21
- model_id = "E-Rong/til-26-ae-agent"
22
- tokenizer = AutoTokenizer.from_pretrained(model_id)
23
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
 
 
 
 
 
 
24
  ```
25
 
26
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
 
 
1
+ # TIL-26-AE Bomberman Agent — MaskablePPO + Curriculum Learning
 
 
 
2
 
3
+ This repository contains the training pipeline for an RL agent competing in the
4
+ **TIL-26 Automated Exploration** (AE) challenge — a competitive multi-agent
5
+ Bomberman-like environment.
6
 
7
+ ## 🎯 Challenge
 
8
 
9
+ [Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16×16 maze. Key challenges:
10
+ - **Partial observability** (directional viewcones, not full map)
11
+ - **Sparse terminal rewards** (±50 for base destroy/survival)
12
+ - **Procedural generation** (new maze every episode)
13
+ - **Risk of camping** near base without exploration signal
14
 
15
+ ## 🏗️ Architecture
 
16
 
17
+ ### Three-Phase Training Pipeline
18
 
19
+ | Phase | Description | Opponents | Key Technique |
20
+ |---|---|---|---|
21
+ | **1** | MaskablePPO baseline | Random valid actions | Invalid action masking |
22
+ | **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `α = 1 − tanh(k·deaths)` |
23
+ | **3** | Curriculum self-play | Rule-based (static → smart) | Elo-style difficulty progression |
24
+
25
+ ### Design Decisions (Literature-Backed)
26
+
27
+ 1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
28
+ 2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
29
+ 3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
30
+ 4. **Curriculum learning**: 4 stages — static → simple → smart → mixed opponents. Advance at 55% win rate (or 500 episodes max).
31
+
32
+ ### Key Papers
33
+
34
+ - **Pommerman multi-agent RL**: arxiv:2407.00662 — 98.85% win rate recipe
35
+ - **MAPPO best practices**: arxiv:2103.01955 — NeurIPS 2022
36
+ - **Invalid Action Masking**: arxiv:2006.14171 — theoretically justified
37
+ - **RND exploration** (fallback): arxiv:1810.12894 — if Phase 2 still camps
38
+
39
+ ## 🚀 Running Training
40
+
41
+ ### Prerequisites
42
+ ```bash
43
+ # Download the environment (auto-bootstrapped in script)
44
+ python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
45
+ ```
46
+
47
+ ### Local Training
48
+ ```bash
49
+ export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
50
+ export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
51
+ export TRACKIO_PROJECT="til-26-ae"
52
+ python train_all_phases.py
53
+ ```
54
+
55
+ ### HF Jobs (Recommended)
56
+ ```bash
57
+ # Requires HF credits — run from a Space with the script uploaded
58
+ # Hardware: cpu-upgrade or a10g-large for GPU acceleration
59
+ ```
60
+
61
+ ## 📊 Monitoring
62
+
63
+ Trackio dashboard: `E-Rong/til-26-ae-trackio`
64
+
65
+ Logged metrics per phase:
66
+ - `train/mean_episode_reward`
67
+ - `train/mean_episode_length`
68
+ - `train/mean_explore_bonus` (Phase 2)
69
+ - `train/curriculum_stage` (Phase 3)
70
+
71
+ Alerts trigger on:
72
+ - Low reward (< -5) after 50k steps → suggests camping
73
+ - Curriculum stage advancement
74
+
75
+ ## 📁 Repository Structure
76
+
77
+ ```
78
+ train_all_phases.py # Full 3-phase pipeline
79
+ requirements.txt # Dependencies
80
+ bomberman_phase1_final.zip # Saved after Phase 1
81
+ bomberman_phase2_final.zip # Saved after Phase 2
82
+ bomberman_phase3_final.zip # Saved after Phase 3
83
+ ```
84
+
85
+ ## 🧪 Evaluation
86
+
87
+ To evaluate a trained agent against random opponents:
88
  ```python
89
+ from train_all_phases import BombermanSingleAgentEnv
90
+ from sb3_contrib import MaskablePPO
91
+ from til_environment.config import default_config
92
 
93
+ cfg = default_config()
94
+ env = BombermanSingleAgentEnv(cfg=cfg)
95
+ model = MaskablePPO.load("bomberman_phase3_final")
96
+
97
+ obs, _ = env.reset(seed=42)
98
+ for _ in range(200):
99
+ action, _ = model.predict(obs, action_masks=env.action_masks())
100
+ obs, reward, done, truncated, info = env.step(action)
101
+ if done or truncated:
102
+ break
103
+ env.close()
104
  ```
105
 
106
+ ## 📜 License
107
+
108
+ MIT — based on the TIL-26 AE challenge environment.