til-26-ae-agent / README.md
E-Rong's picture
Update ML Intern artifact metadata
5a07374 verified
---
tags:
- ml-intern
---
# TIL-26-AE Bomberman Agent β€” MaskablePPO + Curriculum Learning
This repository contains the training pipeline for an RL agent competing in the
**TIL-26 Automated Exploration** (AE) challenge β€” a competitive multi-agent
Bomberman-like environment.
## 🎯 Challenge
[Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16Γ—16 maze. Key challenges:
- **Partial observability** (directional viewcones, not full map)
- **Sparse terminal rewards** (Β±50 for base destroy/survival)
- **Procedural generation** (new maze every episode)
- **Risk of camping** near base without exploration signal
## πŸ—οΈ Architecture
### Three-Phase Training Pipeline
| Phase | Description | Opponents | Key Technique |
|---|---|---|---|
| **1** | MaskablePPO baseline | Random valid actions | Invalid action masking |
| **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `Ξ± = 1 βˆ’ tanh(kΒ·deaths)` |
| **3** | Curriculum self-play | Rule-based (static β†’ smart) | Elo-style difficulty progression |
### Design Decisions (Literature-Backed)
1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
4. **Curriculum learning**: 4 stages β€” static β†’ simple β†’ smart β†’ mixed opponents. Advance at 55% win rate (or 500 episodes max).
### Key Papers
- **Pommerman multi-agent RL**: arxiv:2407.00662 β€” 98.85% win rate recipe
- **MAPPO best practices**: arxiv:2103.01955 β€” NeurIPS 2022
- **Invalid Action Masking**: arxiv:2006.14171 β€” theoretically justified
- **RND exploration** (fallback): arxiv:1810.12894 β€” if Phase 2 still camps
## πŸš€ Running Training
### Prerequisites
```bash
# Download the environment (auto-bootstrapped in script)
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
```
### Local Training
```bash
export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
export TRACKIO_PROJECT="til-26-ae"
python train_all_phases.py
```
### HF Jobs (Recommended)
```bash
# Requires HF credits β€” run from a Space with the script uploaded
# Hardware: cpu-upgrade or a10g-large for GPU acceleration
```
## πŸ“Š Monitoring
Trackio dashboard: `E-Rong/til-26-ae-trackio`
Logged metrics per phase:
- `train/mean_episode_reward`
- `train/mean_episode_length`
- `train/mean_explore_bonus` (Phase 2)
- `train/curriculum_stage` (Phase 3)
Alerts trigger on:
- Low reward (< -5) after 50k steps β†’ suggests camping
- Curriculum stage advancement
## πŸ“ Repository Structure
```
train_all_phases.py # Full 3-phase pipeline
requirements.txt # Dependencies
bomberman_phase1_final.zip # Saved after Phase 1
bomberman_phase2_final.zip # Saved after Phase 2
bomberman_phase3_final.zip # Saved after Phase 3
```
## πŸ§ͺ Evaluation
To evaluate a trained agent against random opponents:
```python
from train_all_phases import BombermanSingleAgentEnv
from sb3_contrib import MaskablePPO
from til_environment.config import default_config
cfg = default_config()
env = BombermanSingleAgentEnv(cfg=cfg)
model = MaskablePPO.load("bomberman_phase3_final")
obs, _ = env.reset(seed=42)
for _ in range(200):
action, _ = model.predict(obs, action_masks=env.action_masks())
obs, reward, done, truncated, info = env.step(action)
if done or truncated:
break
env.close()
```
## πŸ“œ License
MIT β€” based on the TIL-26 AE challenge environment.
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "E-Rong/til-26-ae-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.