Update ML Intern artifact metadata

5a07374 verified about 18 hours ago

4.62 kB

	---
	tags:
	- ml-intern
	---
	# TIL-26-AE Bomberman Agent — MaskablePPO + Curriculum Learning

	This repository contains the training pipeline for an RL agent competing in the
	TIL-26 Automated Exploration (AE) challenge — a competitive multi-agent
	Bomberman-like environment.

	## 🎯 Challenge

	[Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16×16 maze. Key challenges:
	- Partial observability (directional viewcones, not full map)
	- Sparse terminal rewards (±50 for base destroy/survival)
	- Procedural generation (new maze every episode)
	- Risk of camping near base without exploration signal

	## 🏗️ Architecture

	### Three-Phase Training Pipeline

	\| Phase \| Description \| Opponents \| Key Technique \|
	\|---\|---\|---\|---\|
	\| 1 \| MaskablePPO baseline \| Random valid actions \| Invalid action masking \|
	\| 2 \| Adaptive exploration \| Random + visit-count bonus \| Annealing: `α = 1 − tanh(k·deaths)` \|
	\| 3 \| Curriculum self-play \| Rule-based (static → smart) \| Elo-style difficulty progression \|

	### Design Decisions (Literature-Backed)

	1. MaskablePPO (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
	2. MAPPO-style hyperparameters: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
	3. Adaptive exploration annealing: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
	4. Curriculum learning: 4 stages — static → simple → smart → mixed opponents. Advance at 55% win rate (or 500 episodes max).

	### Key Papers

	- Pommerman multi-agent RL: arxiv:2407.00662 — 98.85% win rate recipe
	- MAPPO best practices: arxiv:2103.01955 — NeurIPS 2022
	- Invalid Action Masking: arxiv:2006.14171 — theoretically justified
	- RND exploration (fallback): arxiv:1810.12894 — if Phase 2 still camps

	## 🚀 Running Training

	### Prerequisites
	```bash
	# Download the environment (auto-bootstrapped in script)
	python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
	```

	### Local Training
	```bash
	export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
	export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
	export TRACKIO_PROJECT="til-26-ae"
	python train_all_phases.py
	```

	### HF Jobs (Recommended)
	```bash
	# Requires HF credits — run from a Space with the script uploaded
	# Hardware: cpu-upgrade or a10g-large for GPU acceleration
	```

	## 📊 Monitoring

	Trackio dashboard: `E-Rong/til-26-ae-trackio`

	Logged metrics per phase:
	- `train/mean_episode_reward`
	- `train/mean_episode_length`
	- `train/mean_explore_bonus` (Phase 2)
	- `train/curriculum_stage` (Phase 3)

	Alerts trigger on:
	- Low reward (< -5) after 50k steps → suggests camping
	- Curriculum stage advancement

	## 📁 Repository Structure

	```
	train_all_phases.py # Full 3-phase pipeline
	requirements.txt # Dependencies
	bomberman_phase1_final.zip # Saved after Phase 1
	bomberman_phase2_final.zip # Saved after Phase 2
	bomberman_phase3_final.zip # Saved after Phase 3
	```

	## 🧪 Evaluation

	To evaluate a trained agent against random opponents:
	```python
	from train_all_phases import BombermanSingleAgentEnv
	from sb3_contrib import MaskablePPO
	from til_environment.config import default_config

	cfg = default_config()
	env = BombermanSingleAgentEnv(cfg=cfg)
	model = MaskablePPO.load("bomberman_phase3_final")

	obs, _ = env.reset(seed=42)
	for _ in range(200):
	action, _ = model.predict(obs, action_masks=env.action_masks())
	obs, reward, done, truncated, info = env.step(action)
	if done or truncated:
	break
	env.close()
	```

	## 📜 License

	MIT — based on the TIL-26 AE challenge environment.

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "E-Rong/til-26-ae-agent"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.