til-26-ae-agent / README.md
E-Rong's picture
Update ML Intern artifact metadata
5a07374 verified
|
raw
history blame
4.62 kB
metadata
tags:
  - ml-intern

TIL-26-AE Bomberman Agent β€” MaskablePPO + Curriculum Learning

This repository contains the training pipeline for an RL agent competing in the TIL-26 Automated Exploration (AE) challenge β€” a competitive multi-agent Bomberman-like environment.

🎯 Challenge

Environment: 2–6 team competitive Bomberman on a procedurally generated 16Γ—16 maze. Key challenges:

  • Partial observability (directional viewcones, not full map)
  • Sparse terminal rewards (Β±50 for base destroy/survival)
  • Procedural generation (new maze every episode)
  • Risk of camping near base without exploration signal

πŸ—οΈ Architecture

Three-Phase Training Pipeline

Phase Description Opponents Key Technique
1 MaskablePPO baseline Random valid actions Invalid action masking
2 Adaptive exploration Random + visit-count bonus Annealing: Ξ± = 1 βˆ’ tanh(kΒ·deaths)
3 Curriculum self-play Rule-based (static β†’ smart) Elo-style difficulty progression

Design Decisions (Literature-Backed)

  1. MaskablePPO (sb3-contrib): Handles invalid actions by setting logits to -∞ before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
  2. MAPPO-style hyperparameters: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
  3. Adaptive exploration annealing: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
  4. Curriculum learning: 4 stages β€” static β†’ simple β†’ smart β†’ mixed opponents. Advance at 55% win rate (or 500 episodes max).

Key Papers

  • Pommerman multi-agent RL: arxiv:2407.00662 β€” 98.85% win rate recipe
  • MAPPO best practices: arxiv:2103.01955 β€” NeurIPS 2022
  • Invalid Action Masking: arxiv:2006.14171 β€” theoretically justified
  • RND exploration (fallback): arxiv:1810.12894 β€” if Phase 2 still camps

πŸš€ Running Training

Prerequisites

# Download the environment (auto-bootstrapped in script)
python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"

Local Training

export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
export TRACKIO_PROJECT="til-26-ae"
python train_all_phases.py

HF Jobs (Recommended)

# Requires HF credits β€” run from a Space with the script uploaded
# Hardware: cpu-upgrade or a10g-large for GPU acceleration

πŸ“Š Monitoring

Trackio dashboard: E-Rong/til-26-ae-trackio

Logged metrics per phase:

  • train/mean_episode_reward
  • train/mean_episode_length
  • train/mean_explore_bonus (Phase 2)
  • train/curriculum_stage (Phase 3)

Alerts trigger on:

  • Low reward (< -5) after 50k steps β†’ suggests camping
  • Curriculum stage advancement

πŸ“ Repository Structure

train_all_phases.py   # Full 3-phase pipeline
requirements.txt      # Dependencies
bomberman_phase1_final.zip   # Saved after Phase 1
bomberman_phase2_final.zip   # Saved after Phase 2
bomberman_phase3_final.zip   # Saved after Phase 3

πŸ§ͺ Evaluation

To evaluate a trained agent against random opponents:

from train_all_phases import BombermanSingleAgentEnv
from sb3_contrib import MaskablePPO
from til_environment.config import default_config

cfg = default_config()
env = BombermanSingleAgentEnv(cfg=cfg)
model = MaskablePPO.load("bomberman_phase3_final")

obs, _ = env.reset(seed=42)
for _ in range(200):
    action, _ = model.predict(obs, action_masks=env.action_masks())
    obs, reward, done, truncated, info = env.step(action)
    if done or truncated:
        break
env.close()

πŸ“œ License

MIT β€” based on the TIL-26 AE challenge environment.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "E-Rong/til-26-ae-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.