til-26-ae-agent / README.md
E-Rong's picture
Upload README.md
91a90a5 verified
|
raw
history blame
3.92 kB

TIL-26-AE Bomberman Agent β€” MaskablePPO + Curriculum Learning

This repository contains the training pipeline for an RL agent competing in the TIL-26 Automated Exploration (AE) challenge β€” a competitive multi-agent Bomberman-like environment.

🎯 Challenge

Environment: 2–6 team competitive Bomberman on a procedurally generated 16Γ—16 maze. Key challenges:

  • Partial observability (directional viewcones, not full map)
  • Sparse terminal rewards (Β±50 for base destroy/survival)
  • Procedural generation (new maze every episode)
  • Risk of camping near base without exploration signal

πŸ—οΈ Architecture

Three-Phase Training Pipeline

Phase Description Opponents Key Technique
1 MaskablePPO baseline Random valid actions Invalid action masking
2 Adaptive exploration Random + visit-count bonus Annealing: Ξ± = 1 βˆ’ tanh(kΒ·deaths)
3 Curriculum self-play Rule-based (static β†’ smart) Elo-style difficulty progression

Design Decisions (Literature-Backed)

  1. MaskablePPO (sb3-contrib): Handles invalid actions by setting logits to -∞ before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
  2. MAPPO-style hyperparameters: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
  3. Adaptive exploration annealing: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
  4. Curriculum learning: 4 stages β€” static β†’ simple β†’ smart β†’ mixed opponents. Advance at 55% win rate (or 500 episodes max).

Key Papers

  • Pommerman multi-agent RL: arxiv:2407.00662 β€” 98.85% win rate recipe
  • MAPPO best practices: arxiv:2103.01955 β€” NeurIPS 2022
  • Invalid Action Masking: arxiv:2006.14171 β€” theoretically justified
  • RND exploration (fallback): arxiv:1810.12894 β€” if Phase 2 still camps

πŸš€ Running Training

Prerequisites

# Download the environment (auto-bootstrapped in script)
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"

Local Training

export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
export TRACKIO_PROJECT="til-26-ae"
python train_all_phases.py

HF Jobs (Recommended)

# Requires HF credits β€” run from a Space with the script uploaded
# Hardware: cpu-upgrade or a10g-large for GPU acceleration

πŸ“Š Monitoring

Trackio dashboard: E-Rong/til-26-ae-trackio

Logged metrics per phase:

  • train/mean_episode_reward
  • train/mean_episode_length
  • train/mean_explore_bonus (Phase 2)
  • train/curriculum_stage (Phase 3)

Alerts trigger on:

  • Low reward (< -5) after 50k steps β†’ suggests camping
  • Curriculum stage advancement

πŸ“ Repository Structure

train_all_phases.py   # Full 3-phase pipeline
requirements.txt      # Dependencies
bomberman_phase1_final.zip   # Saved after Phase 1
bomberman_phase2_final.zip   # Saved after Phase 2
bomberman_phase3_final.zip   # Saved after Phase 3

πŸ§ͺ Evaluation

To evaluate a trained agent against random opponents:

from train_all_phases import BombermanSingleAgentEnv
from sb3_contrib import MaskablePPO
from til_environment.config import default_config

cfg = default_config()
env = BombermanSingleAgentEnv(cfg=cfg)
model = MaskablePPO.load("bomberman_phase3_final")

obs, _ = env.reset(seed=42)
for _ in range(200):
    action, _ = model.predict(obs, action_masks=env.action_masks())
    obs, reward, done, truncated, info = env.step(action)
    if done or truncated:
        break
env.close()

πŸ“œ License

MIT β€” based on the TIL-26 AE challenge environment.