File size: 4,618 Bytes
5a07374 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 c19c488 91a90a5 5a07374 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | ---
tags:
- ml-intern
---
# TIL-26-AE Bomberman Agent β MaskablePPO + Curriculum Learning
This repository contains the training pipeline for an RL agent competing in the
**TIL-26 Automated Exploration** (AE) challenge β a competitive multi-agent
Bomberman-like environment.
## π― Challenge
[Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2β6 team competitive Bomberman on a procedurally generated 16Γ16 maze. Key challenges:
- **Partial observability** (directional viewcones, not full map)
- **Sparse terminal rewards** (Β±50 for base destroy/survival)
- **Procedural generation** (new maze every episode)
- **Risk of camping** near base without exploration signal
## ποΈ Architecture
### Three-Phase Training Pipeline
| Phase | Description | Opponents | Key Technique |
|---|---|---|---|
| **1** | MaskablePPO baseline | Random valid actions | Invalid action masking |
| **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `Ξ± = 1 β tanh(kΒ·deaths)` |
| **3** | Curriculum self-play | Rule-based (static β smart) | Elo-style difficulty progression |
### Design Decisions (Literature-Backed)
1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-β` before softmax. Proven superior to action penalties (Huang & OntaΓ±Γ³n, 2020).
2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
4. **Curriculum learning**: 4 stages β static β simple β smart β mixed opponents. Advance at 55% win rate (or 500 episodes max).
### Key Papers
- **Pommerman multi-agent RL**: arxiv:2407.00662 β 98.85% win rate recipe
- **MAPPO best practices**: arxiv:2103.01955 β NeurIPS 2022
- **Invalid Action Masking**: arxiv:2006.14171 β theoretically justified
- **RND exploration** (fallback): arxiv:1810.12894 β if Phase 2 still camps
## π Running Training
### Prerequisites
```bash
# Download the environment (auto-bootstrapped in script)
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
```
### Local Training
```bash
export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
export TRACKIO_PROJECT="til-26-ae"
python train_all_phases.py
```
### HF Jobs (Recommended)
```bash
# Requires HF credits β run from a Space with the script uploaded
# Hardware: cpu-upgrade or a10g-large for GPU acceleration
```
## π Monitoring
Trackio dashboard: `E-Rong/til-26-ae-trackio`
Logged metrics per phase:
- `train/mean_episode_reward`
- `train/mean_episode_length`
- `train/mean_explore_bonus` (Phase 2)
- `train/curriculum_stage` (Phase 3)
Alerts trigger on:
- Low reward (< -5) after 50k steps β suggests camping
- Curriculum stage advancement
## π Repository Structure
```
train_all_phases.py # Full 3-phase pipeline
requirements.txt # Dependencies
bomberman_phase1_final.zip # Saved after Phase 1
bomberman_phase2_final.zip # Saved after Phase 2
bomberman_phase3_final.zip # Saved after Phase 3
```
## π§ͺ Evaluation
To evaluate a trained agent against random opponents:
```python
from train_all_phases import BombermanSingleAgentEnv
from sb3_contrib import MaskablePPO
from til_environment.config import default_config
cfg = default_config()
env = BombermanSingleAgentEnv(cfg=cfg)
model = MaskablePPO.load("bomberman_phase3_final")
obs, _ = env.reset(seed=42)
for _ in range(200):
action, _ = model.predict(obs, action_masks=env.action_masks())
obs, reward, done, truncated, info = env.step(action)
if done or truncated:
break
env.close()
```
## π License
MIT β based on the TIL-26 AE challenge environment.
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "E-Rong/til-26-ae-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
|