File size: 4,618 Bytes
5a07374
 
 
 
91a90a5
c19c488
91a90a5
 
 
c19c488
91a90a5
c19c488
91a90a5
 
 
 
 
c19c488
91a90a5
c19c488
91a90a5
c19c488
91a90a5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c19c488
91a90a5
 
 
c19c488
91a90a5
 
 
 
 
 
 
 
 
 
 
c19c488
 
91a90a5
 
 
5a07374
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
tags:
- ml-intern
---
# TIL-26-AE Bomberman Agent β€” MaskablePPO + Curriculum Learning

This repository contains the training pipeline for an RL agent competing in the
**TIL-26 Automated Exploration** (AE) challenge β€” a competitive multi-agent
Bomberman-like environment.

## 🎯 Challenge

[Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16Γ—16 maze. Key challenges:
- **Partial observability** (directional viewcones, not full map)
- **Sparse terminal rewards** (Β±50 for base destroy/survival)
- **Procedural generation** (new maze every episode)
- **Risk of camping** near base without exploration signal

## πŸ—οΈ Architecture

### Three-Phase Training Pipeline

| Phase | Description | Opponents | Key Technique |
|---|---|---|---|
| **1** | MaskablePPO baseline | Random valid actions | Invalid action masking |
| **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `Ξ± = 1 βˆ’ tanh(kΒ·deaths)` |
| **3** | Curriculum self-play | Rule-based (static β†’ smart) | Elo-style difficulty progression |

### Design Decisions (Literature-Backed)

1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
4. **Curriculum learning**: 4 stages β€” static β†’ simple β†’ smart β†’ mixed opponents. Advance at 55% win rate (or 500 episodes max).

### Key Papers

- **Pommerman multi-agent RL**: arxiv:2407.00662 β€” 98.85% win rate recipe
- **MAPPO best practices**: arxiv:2103.01955 β€” NeurIPS 2022
- **Invalid Action Masking**: arxiv:2006.14171 β€” theoretically justified
- **RND exploration** (fallback): arxiv:1810.12894 β€” if Phase 2 still camps

## πŸš€ Running Training

### Prerequisites
```bash
# Download the environment (auto-bootstrapped in script)
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
```

### Local Training
```bash
export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
export TRACKIO_PROJECT="til-26-ae"
python train_all_phases.py
```

### HF Jobs (Recommended)
```bash
# Requires HF credits β€” run from a Space with the script uploaded
# Hardware: cpu-upgrade or a10g-large for GPU acceleration
```

## πŸ“Š Monitoring

Trackio dashboard: `E-Rong/til-26-ae-trackio`

Logged metrics per phase:
- `train/mean_episode_reward`
- `train/mean_episode_length`
- `train/mean_explore_bonus` (Phase 2)
- `train/curriculum_stage` (Phase 3)

Alerts trigger on:
- Low reward (< -5) after 50k steps β†’ suggests camping
- Curriculum stage advancement

## πŸ“ Repository Structure

```
train_all_phases.py   # Full 3-phase pipeline
requirements.txt      # Dependencies
bomberman_phase1_final.zip   # Saved after Phase 1
bomberman_phase2_final.zip   # Saved after Phase 2
bomberman_phase3_final.zip   # Saved after Phase 3
```

## πŸ§ͺ Evaluation

To evaluate a trained agent against random opponents:
```python
from train_all_phases import BombermanSingleAgentEnv
from sb3_contrib import MaskablePPO
from til_environment.config import default_config

cfg = default_config()
env = BombermanSingleAgentEnv(cfg=cfg)
model = MaskablePPO.load("bomberman_phase3_final")

obs, _ = env.reset(seed=42)
for _ in range(200):
    action, _ = model.predict(obs, action_masks=env.action_masks())
    obs, reward, done, truncated, info = env.step(action)
    if done or truncated:
        break
env.close()
```

## πŸ“œ License

MIT β€” based on the TIL-26 AE challenge environment.

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "E-Rong/til-26-ae-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```

For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.