--- tags: - ml-intern --- # TIL-26-AE Bomberman Agent β€” MaskablePPO + Curriculum Learning This repository contains the training pipeline for an RL agent competing in the **TIL-26 Automated Exploration** (AE) challenge β€” a competitive multi-agent Bomberman-like environment. ## 🎯 Challenge [Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16Γ—16 maze. Key challenges: - **Partial observability** (directional viewcones, not full map) - **Sparse terminal rewards** (Β±50 for base destroy/survival) - **Procedural generation** (new maze every episode) - **Risk of camping** near base without exploration signal ## πŸ—οΈ Architecture ### Three-Phase Training Pipeline | Phase | Description | Opponents | Key Technique | |---|---|---|---| | **1** | MaskablePPO baseline | Random valid actions | Invalid action masking | | **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `Ξ± = 1 βˆ’ tanh(kΒ·deaths)` | | **3** | Curriculum self-play | Rule-based (static β†’ smart) | Elo-style difficulty progression | ### Design Decisions (Literature-Backed) 1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & OntaΓ±Γ³n, 2020). 2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022). 3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping. 4. **Curriculum learning**: 4 stages β€” static β†’ simple β†’ smart β†’ mixed opponents. Advance at 55% win rate (or 500 episodes max). ### Key Papers - **Pommerman multi-agent RL**: arxiv:2407.00662 β€” 98.85% win rate recipe - **MAPPO best practices**: arxiv:2103.01955 β€” NeurIPS 2022 - **Invalid Action Masking**: arxiv:2006.14171 β€” theoretically justified - **RND exploration** (fallback): arxiv:1810.12894 β€” if Phase 2 still camps ## πŸš€ Running Training ### Prerequisites ```bash # Download the environment (auto-bootstrapped in script) python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')" ``` ### Local Training ```bash export TOTAL_TIMESTEPS="500_000:500_000:1_000_000" export HUB_MODEL_ID="E-Rong/til-26-ae-agent" export TRACKIO_PROJECT="til-26-ae" python train_all_phases.py ``` ### HF Jobs (Recommended) ```bash # Requires HF credits β€” run from a Space with the script uploaded # Hardware: cpu-upgrade or a10g-large for GPU acceleration ``` ## πŸ“Š Monitoring Trackio dashboard: `E-Rong/til-26-ae-trackio` Logged metrics per phase: - `train/mean_episode_reward` - `train/mean_episode_length` - `train/mean_explore_bonus` (Phase 2) - `train/curriculum_stage` (Phase 3) Alerts trigger on: - Low reward (< -5) after 50k steps β†’ suggests camping - Curriculum stage advancement ## πŸ“ Repository Structure ``` train_all_phases.py # Full 3-phase pipeline requirements.txt # Dependencies bomberman_phase1_final.zip # Saved after Phase 1 bomberman_phase2_final.zip # Saved after Phase 2 bomberman_phase3_final.zip # Saved after Phase 3 ``` ## πŸ§ͺ Evaluation To evaluate a trained agent against random opponents: ```python from train_all_phases import BombermanSingleAgentEnv from sb3_contrib import MaskablePPO from til_environment.config import default_config cfg = default_config() env = BombermanSingleAgentEnv(cfg=cfg) model = MaskablePPO.load("bomberman_phase3_final") obs, _ = env.reset(seed=42) for _ in range(200): action, _ = model.predict(obs, action_masks=env.action_masks()) obs, reward, done, truncated, info = env.step(action) if done or truncated: break env.close() ``` ## πŸ“œ License MIT β€” based on the TIL-26 AE challenge environment. ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "E-Rong/til-26-ae-agent" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.