| --- |
| tags: |
| - ml-intern |
| --- |
| # TIL-26-AE Bomberman Agent β MaskablePPO + Curriculum Learning |
|
|
| This repository contains the training pipeline for an RL agent competing in the |
| **TIL-26 Automated Exploration** (AE) challenge β a competitive multi-agent |
| Bomberman-like environment. |
|
|
| ## π― Challenge |
|
|
| [Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2β6 team competitive Bomberman on a procedurally generated 16Γ16 maze. Key challenges: |
| - **Partial observability** (directional viewcones, not full map) |
| - **Sparse terminal rewards** (Β±50 for base destroy/survival) |
| - **Procedural generation** (new maze every episode) |
| - **Risk of camping** near base without exploration signal |
|
|
| ## ποΈ Architecture |
|
|
| ### Three-Phase Training Pipeline |
|
|
| | Phase | Description | Opponents | Key Technique | |
| |---|---|---|---| |
| | **1** | MaskablePPO baseline | Random valid actions | Invalid action masking | |
| | **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `Ξ± = 1 β tanh(kΒ·deaths)` | |
| | **3** | Curriculum self-play | Rule-based (static β smart) | Elo-style difficulty progression | |
|
|
| ### Design Decisions (Literature-Backed) |
|
|
| 1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-β` before softmax. Proven superior to action penalties (Huang & OntaΓ±Γ³n, 2020). |
| 2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022). |
| 3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping. |
| 4. **Curriculum learning**: 4 stages β static β simple β smart β mixed opponents. Advance at 55% win rate (or 500 episodes max). |
|
|
| ### Key Papers |
|
|
| - **Pommerman multi-agent RL**: arxiv:2407.00662 β 98.85% win rate recipe |
| - **MAPPO best practices**: arxiv:2103.01955 β NeurIPS 2022 |
| - **Invalid Action Masking**: arxiv:2006.14171 β theoretically justified |
| - **RND exploration** (fallback): arxiv:1810.12894 β if Phase 2 still camps |
|
|
| ## π Running Training |
|
|
| ### Prerequisites |
| ```bash |
| # Download the environment (auto-bootstrapped in script) |
| python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')" |
| ``` |
|
|
| ### Local Training |
| ```bash |
| export TOTAL_TIMESTEPS="500_000:500_000:1_000_000" |
| export HUB_MODEL_ID="E-Rong/til-26-ae-agent" |
| export TRACKIO_PROJECT="til-26-ae" |
| python train_all_phases.py |
| ``` |
|
|
| ### HF Jobs (Recommended) |
| ```bash |
| # Requires HF credits β run from a Space with the script uploaded |
| # Hardware: cpu-upgrade or a10g-large for GPU acceleration |
| ``` |
|
|
| ## π Monitoring |
|
|
| Trackio dashboard: `E-Rong/til-26-ae-trackio` |
|
|
| Logged metrics per phase: |
| - `train/mean_episode_reward` |
| - `train/mean_episode_length` |
| - `train/mean_explore_bonus` (Phase 2) |
| - `train/curriculum_stage` (Phase 3) |
|
|
| Alerts trigger on: |
| - Low reward (< -5) after 50k steps β suggests camping |
| - Curriculum stage advancement |
|
|
| ## π Repository Structure |
|
|
| ``` |
| train_all_phases.py # Full 3-phase pipeline |
| requirements.txt # Dependencies |
| bomberman_phase1_final.zip # Saved after Phase 1 |
| bomberman_phase2_final.zip # Saved after Phase 2 |
| bomberman_phase3_final.zip # Saved after Phase 3 |
| ``` |
|
|
| ## π§ͺ Evaluation |
|
|
| To evaluate a trained agent against random opponents: |
| ```python |
| from train_all_phases import BombermanSingleAgentEnv |
| from sb3_contrib import MaskablePPO |
| from til_environment.config import default_config |
| |
| cfg = default_config() |
| env = BombermanSingleAgentEnv(cfg=cfg) |
| model = MaskablePPO.load("bomberman_phase3_final") |
| |
| obs, _ = env.reset(seed=42) |
| for _ in range(200): |
| action, _ = model.predict(obs, action_masks=env.action_masks()) |
| obs, reward, done, truncated, info = env.step(action) |
| if done or truncated: |
| break |
| env.close() |
| ``` |
|
|
| ## π License |
|
|
| MIT β based on the TIL-26 AE challenge environment. |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "E-Rong/til-26-ae-agent" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|