E-Rong
/

til-26-ae-agent

ml-intern

Model card Files Files and versions

xet

Community

E-Rong commited on about 20 hours ago

Commit

91a90a5

verified ·

1 Parent(s): 7be626a

Upload README.md

Browse files

Files changed (1) hide show

README.md +98 -16

README.md CHANGED Viewed

@@ -1,26 +1,108 @@
----
-tags:
-- ml-intern
----
-# E-Rong/til-26-ae-agent
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
 ```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "E-Rong/til-26-ae-agent"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
 ```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

+# TIL-26-AE Bomberman Agent — MaskablePPO + Curriculum Learning
+This repository contains the training pipeline for an RL agent competing in the
+**TIL-26 Automated Exploration** (AE) challenge — a competitive multi-agent
+Bomberman-like environment.
+## 🎯 Challenge
+[Environment](https://huggingface.co/spaces/e-rong/til-26-ae): 2–6 team competitive Bomberman on a procedurally generated 16×16 maze. Key challenges:
+- **Partial observability** (directional viewcones, not full map)
+- **Sparse terminal rewards** (±50 for base destroy/survival)
+- **Procedural generation** (new maze every episode)
+- **Risk of camping** near base without exploration signal
+## 🏗️ Architecture
+### Three-Phase Training Pipeline
+| Phase | Description | Opponents | Key Technique |
+|---|---|---|---|
+| **1** | MaskablePPO baseline | Random valid actions | Invalid action masking |
+| **2** | Adaptive exploration | Random + visit-count bonus | Annealing: `α = 1 − tanh(k·deaths)` |
+| **3** | Curriculum self-play | Rule-based (static → smart) | Elo-style difficulty progression |
+### Design Decisions (Literature-Backed)
+1. **MaskablePPO** (`sb3-contrib`): Handles invalid actions by setting logits to `-∞` before softmax. Proven superior to action penalties (Huang & Ontañón, 2020).
+2. **MAPPO-style hyperparameters**: Value normalization, centralized value / decentralized policy, low sample reuse (Yu et al., NeurIPS 2022).
+3. **Adaptive exploration annealing**: Directly from Pommerman SOTA (2024). As agent skill improves, exploration bonus decreases automatically, preventing camping.
+4. **Curriculum learning**: 4 stages — static → simple → smart → mixed opponents. Advance at 55% win rate (or 500 episodes max).
+### Key Papers
+- **Pommerman multi-agent RL**: arxiv:2407.00662 — 98.85% win rate recipe
+- **MAPPO best practices**: arxiv:2103.01955 — NeurIPS 2022
+- **Invalid Action Masking**: arxiv:2006.14171 — theoretically justified
+- **RND exploration** (fallback): arxiv:1810.12894 — if Phase 2 still camps
+## 🚀 Running Training
+### Prerequisites
+```bash
+# Download the environment (auto-bootstrapped in script)
+python -c "from huggingface_hub import snapshot_download; snapshot_download('e-rong/til-26-ae', repo_type='space', local_dir='./til-26-ae-repo')"
+```
+### Local Training
+```bash
+export TOTAL_TIMESTEPS="500_000:500_000:1_000_000"
+export HUB_MODEL_ID="E-Rong/til-26-ae-agent"
+export TRACKIO_PROJECT="til-26-ae"
+python train_all_phases.py
+```
+### HF Jobs (Recommended)
+```bash
+# Requires HF credits — run from a Space with the script uploaded
+# Hardware: cpu-upgrade or a10g-large for GPU acceleration
+```
+## 📊 Monitoring
+Trackio dashboard: `E-Rong/til-26-ae-trackio`
+Logged metrics per phase:
+- `train/mean_episode_reward`
+- `train/mean_episode_length`
+- `train/mean_explore_bonus` (Phase 2)
+- `train/curriculum_stage` (Phase 3)
+Alerts trigger on:
+- Low reward (< -5) after 50k steps → suggests camping
+- Curriculum stage advancement
+## 📁 Repository Structure
+```
+train_all_phases.py   # Full 3-phase pipeline
+requirements.txt      # Dependencies
+bomberman_phase1_final.zip   # Saved after Phase 1
+bomberman_phase2_final.zip   # Saved after Phase 2
+bomberman_phase3_final.zip   # Saved after Phase 3
+```
+## 🧪 Evaluation
+To evaluate a trained agent against random opponents:
 ```python
+from train_all_phases import BombermanSingleAgentEnv
+from sb3_contrib import MaskablePPO
+from til_environment.config import default_config
+cfg = default_config()
+env = BombermanSingleAgentEnv(cfg=cfg)
+model = MaskablePPO.load("bomberman_phase3_final")
+obs, _ = env.reset(seed=42)
+for _ in range(200):
+    action, _ = model.predict(obs, action_masks=env.action_masks())
+    obs, reward, done, truncated, info = env.step(action)
+    if done or truncated:
+        break
+env.close()
 ```
+## 📜 License
+MIT — based on the TIL-26 AE challenge environment.