E-Rong
/

til-26-ae-agent

ml-intern

Model card Files Files and versions

xet

Community

E-Rong commited on about 6 hours ago

Commit

a00144d

verified ·

1 Parent(s): 5c6cad0

Update docs: Phase 2 started, add interim results

Browse files

Files changed (1) hide show

docs/ae.md +64 -260

docs/ae.md CHANGED Viewed

@@ -23,43 +23,28 @@
 ### 1.1 Domain: Multi-Agent Bomberman RL
-The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration** — agents must learn to navigate, compete, and survive without hand-crafted heuristics.
-### 1.2 Key Papers Consulted
 | Paper | arXiv ID | Key Insight | Relevance |
 |---|---|---|---|
-| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | Multi-agent competitive environment similar to Bomberman; MAPPO baseline performance | Confirmed PettingZoo + parallel env as standard approach |
-| *The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games* | 2103.01955 | MAPPO with shared parameters, role-specific conditioning | Justified single-agent wrapper with self-play curriculum |
-| *Superstition, Imagination, and the Invalid Action Problem* | 2006.14171 | Invalid action masking improves sample efficiency dramatically in discrete action spaces with legal action constraints | **Directly applicable** — Bomberman has wall/edge constraints |
-| *Proximal Policy Optimization Algorithms* | 1707.06347 | PPO with clipped surrogate objective, stable and scalable | Chosen over DQN for continuous policy updates and easier masking integration |
 ### 1.3 Why MaskablePPO?
-After reading `arxiv:2006.14171`, we identified that **invalid action masking** is critical for this domain:
-- Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile
-- The observation includes `action_mask: uint8[6]` — a binary legal-action indicator
-- Standard PPO would waste ~30-40% of samples on illegal moves early in training
-- `sb3-contrib`'s `MaskablePPO` masks logits before softmax, ensuring only legal actions are sampled
-**Decision**: Use `sb3-contrib`'s `MaskablePPO` with `ActionMasker` wrapper.
 ### 1.4 Why Curriculum Learning?
-From `arxiv:2103.01955` (MAPPO) and Pommerman benchmarks, we learned:
-- Training against strong opponents from scratch leads to **catastrophic early losses** (~0 reward)
-- Curriculum learning (easy → hard) is standard practice in competitive multi-agent RL
-- Rule-based opponents at increasing difficulty provide stable reward signals during learning
-**Decision**: Implement a 3-phase curriculum with adaptive difficulty gating.
-### 1.5 Why Not DQN / Rainbow?
-- DQN struggles with action masking (requires custom architecture)
-- PPO's on-policy updates handle the non-stationarity of multi-agent self-play better
-- PPO is simpler to tune and has mature invalid-action-masking support in `sb3-contrib`
 ---
@@ -67,176 +52,66 @@ From `arxiv:2103.01955` (MAPPO) and Pommerman benchmarks, we learned:
 ### 2.1 Environment Structure
-The `til-26-ae` environment (`e-rong/til-26-ae` Space) is a PettingZoo-style AEC (Agent-Environment-Commutative) multi-agent game:
-- **Grid size**: 16×16 (confirmed from `default_config()`)
 - **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
-- **Observations**: Dict with:
-  - `agent_viewcone`: float32 [7×5×25] — agent-facing view
-  - `base_viewcone`: float32 [5×5×25] — base-centered view
-  - `direction`: Discrete(4) — facing
-  - `location`, `base_location`, `health`, `frozen_ticks`, `base_health`, `team_resources`, `team_bombs`, `step`
-  - `action_mask`: uint8[6] — binary legality mask
-- **Actions**: Discrete(6)
-  - 0 = FORWARD, 1 = BACKWARD, 2 = LEFT, 3 = RIGHT, 4 = STAY, 5 = PLACE_BOMB
-- **Episode length**: ~200 steps (observed during training)
 ### 2.2 Observation Flattening
-We flatten the dict observation into a **1511-dim vector**:
-```
-agent_viewcone:  7 × 5 × 25 = 875
-base_viewcone:   5 × 5 × 25 = 625
-direction:       1
-location:        2
-base_location:   2
-health:          1
-frozen_ticks:    1
-base_health:     1
-team_resources:  1
-team_bombs:      1
-step:            1
-─────────────────────────────
-TOTAL:          1511
-```
-This matches the MLP policy input in `MaskablePPO("MlpPolicy", ...)`.
-### 2.3 Action Masking Implementation
-```python
-env = ActionMasker(base_env, lambda e: e.action_masks())
-```
-The wrapper exposes `action_masks()` which returns a bool[6] array. `MaskablePPO` uses this internally via `sb3_contrib`'s `get_action_masks()` during rollout collection.
-**Critical bug found**: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. We fixed this ordering issue during development.
 ---
 ## 3. Development Decisions
-### 3.1 Why Single-Agent Wrapper?
-The TIL environment is inherently multi-agent (PettingZoo AEC). However, for the AE challenge, we only control **agent_0**; opponents use fixed policies. We wrapped the parallel PettingZoo env into a `gymnasium.Env` that:
-- Runs the full multi-agent step
-- Returns only agent_0's observation/reward/done
-- Uses random valid actions for opponents (Phase 1-2) or rule-based policies (Phase 3)
-This reduces the problem to single-agent RL with a non-stationary environment (opponent policies change between phases).
-### 3.2 Why 3-Phase Curriculum?
 | Phase | Opponent | Duration | Purpose |
 |---|---|---|---|
-| **1** | Random valid actions | 500k steps | Learn basic movement, bomb mechanics, map navigation |
-| **2** | Random + exploration shaping | 500k steps | Prevent "camping" exploit; encourage full map coverage |
-| **3** | Rule-based (curriculum) | 1M steps | Generalize to structured opponents; scale to multi-team |
-**Phase 1 vs Random** gives the agent a chance to learn fundamentals without being immediately killed by competent opponents. Random opponents still place bombs and move, providing exposure to explosion mechanics.
-**Phase 2 Exploration Shaping** addresses a known issue: agents learn to survive by staying near their base and waiting for random opponents to walk into bombs. The visit-count bonus (`1/(1+visits)`) forces the agent to explore new tiles.
-**Phase 3 Curriculum** transitions from random to structured opponents using a difficulty ladder: static → simple → smart → mixed. This mirrors how humans learn and prevents the "forgetting" problem when suddenly switching opponent types.
-### 3.3 Why Stable-Baselines3 + sb3-contrib?
-| Library | Role |
-|---|---|
-| `stable-baselines3` | Core PPO implementation, callbacks, Monitor, checkpoints |
-| `sb3-contrib` | `MaskablePPO`, `ActionMasker`, invalid-action masking utilities |
-| `gymnasium` | Env API (observation/action spaces, step/reset) |
-| `pettingzoo` | Multi-agent env conversion (`aec_to_parallel`) |
-| `huggingface_hub` | Push checkpoints to persistent storage |
-### 3.4 Why Push Checkpoints to Hub Every 50k Steps?
-During development, we encountered **sandbox resets** (T4 container recycled unexpectedly). Local `/app/data/` was lost, but the Hub model repo (`E-Rong/til-26-ae-agent`) persisted.
-**Decision**: Implement a dual-save strategy:
-- Local: `CheckpointCallback(save_freq=50000)`
-- Hub: Custom callback calling `HfApi.upload_file()` every 50k steps
-This saved the project when the sandbox reset at 400k steps — we resumed from `ckpt_400000.zip` on the Hub without losing progress.
-### 3.5 Why A10G over T4?
-| Hardware | FPS | Time for 100k steps |
-|---|---|---|
-| T4 | ~42 | ~40 min |
-| A10G | ~52 | ~32 min |
-A10G provided more stable performance and the same 24GB VRAM. Given the ~2M total steps across 3 phases, A10G saves ~2 hours total.
----
-## 4. Training Phases
-### 4.1 Phase 1: Foundation (MaskablePPO vs Random)
-**Duration**: 500,000 steps
-**Opponent**: Random valid actions
-**Environment**: 2 teams, agent_0 vs random
-**Hyperparameters**:
-```python
-MaskablePPO(
-    "MlpPolicy", env,
-    learning_rate=3e-4,
-    n_steps=2048,
-    batch_size=64,
-    n_epochs=10,
-    gamma=0.99,
-    gae_lambda=0.95,
-    clip_range=0.2,
-    ent_coef=0.01,       # Encourage exploration early
-    vf_coef=0.5,
-    max_grad_norm=0.5,
-    device="cuda",
-)
-```
-**Purpose**: Learn basic survival, map layout, bomb mechanics, and opponent interaction. Random opponents provide a low-stakes environment where the agent can experiment without being immediately eliminated.
-**Challenges encountered**:
-- Initial wrapper ordering bug (`Monitor` inside `ActionMasker`)
-- Missing dependencies (`omegaconf`, `perlin_noise`) in fresh sandboxes
-- Sandbox resets — resolved by Hub checkpointing
-### 4.2 Phase 2: Exploration Shaping (Adaptive Annealing)
-**Duration**: 500,000 steps
-**Opponent**: Random valid actions
-**Environment**: 2 teams + exploration bonus
-**Mechanism**:
-```python
-# Visit-count bonus
-visit_bonus = 1.0 / (1.0 + visit_counts[x, y])
-# Adaptive annealing
-alpha = 1.0 - tanh(k * avg_enemy_deaths)
-explore_weight = base_weight * max(0.1, alpha)
-```
-As the agent gets better at killing enemies, the exploration bonus fades, shifting focus toward combat optimization.
-**Purpose**: Prevent the "camping" exploit where agents hide near their base and wait. Force proactive map exploration and resource collection.
-### 4.3 Phase 3: Curriculum Self-Play (Rule-Based Opponents)
-**Duration**: 1,000,000 steps
-**Opponent**: Rule-based with curriculum difficulty
-**Environment**: 3 teams
-**Curriculum stages**:
-1. **Static**: Opponents do nothing (STAY)
-2. **Simple**: Bomb when enemy in viewcone, otherwise random move
-3. **Smart**: Score-based action selection (collectibles + wall avoidance)
-4. **Mixed**: Half smart, half simple
-**Advancement condition**: Win rate ≥ 55% over 500 episodes, or max 500 episodes per stage.
-**Purpose**: Generalize to structured, multi-team competition. The curriculum ensures the agent doesn't face a difficulty cliff when switching from random to competent opponents.
 ---
@@ -246,113 +121,42 @@ As the agent gets better at killing enemies, the exploration bonus fades, shifti
 | Metric | Value |
 |---|---|
-| **Timesteps** | 500,352 |
-| **Final Training Reward** | 237.0 |
-| **FPS** | 52 (A10G) |
-| **Total wall time** | ~2h 15min |
-| **Checkpoints** | ckpt_50000, 100000, 150000, 200000, 250000, 300000, 350000, 400000 + phase1_final |
-### 5.2 Phase 1 Evaluation (100 Episodes vs Random Opponents)
 | Metric | Value |
 |---|---|
-| **Win Rate** | **92.0%** (92/100) |
-| **Average Reward** | **180.1** |
-| **Average Episode Length** | 200.0 steps |
-| **Average Bombs/Episode** | 20.4 |
-| **Survival Rate** | **100.0%** |
-**Interpretation**: The agent has mastered the basics against random opponents. It consistently survives full episodes, places bombs frequently, and wins nearly every match. The gap between training reward (237) and eval reward (180) suggests some reward shaping (e.g., exploration bonus) during training that doesn't transfer to deterministic eval.
-### 5.3 Training Reward Trajectory (Phase 1)
-| Steps | Episode Reward | Notes |
-|---|---|---|
-| 2,048 | 41.4 | Initial random policy |
-| 20,480 | ~104 | Learning movement |
-| 53,248 | ~116 | First checkpoint |
-| 110,592 | ~159 | Consistent improvement |
-| 204,096 | ~219 | Strong policy emerging |
-| 306,496 | ~203 | Slight dip (exploration) |
-| 416,384 | ~224 | Convergence |
-| 500,352 | 237.0 | Final |
 ---
 ## 6. Artifacts
-### 6.1 Model Repo (`E-Rong/til-26-ae-agent`)
-| File | Purpose |
-|---|---|
-| `phase1_final.zip` | Phase 1 trained model (500k steps) |
-| `ckpt_50000.zip` – `ckpt_400000.zip` | Intermediate checkpoints |
-| `ae_manager.py` | Inference code for AE server |
-| `phase1_eval_results.txt` | Raw evaluation numbers |
-| `phase1_summary.txt` | This summary (abridged) |
-| `train_all_phases.py` | Full training script |
-| `train_in_space.py` | Space-compatible training script |
-| `requirements.txt` | Python dependencies |
-### 6.2 Space Integration (`e-rong/til-26-ae`)
 | File | Purpose |
 |---|---|
-| `ae/src/ae_manager.py` | Loads `phase1_final.zip` from Hub, serves actions via `/ae` endpoint |
-| `ae/requirements.txt` | `sb3-contrib`, `torch`, `huggingface_hub` |
-| `ae/Dockerfile` | Standard Python 3.11 image, CPU-only for fast eval startup |
-### 6.3 How Inference Works
-1. AE server receives `POST /ae` with observation dict
-2. `AEManager.ae(observation)` flattens obs → 1511-dim vector
-3. Loads `MaskablePPO` from `phase1_final.zip` (cached at `/workspace/models/`)
-4. Calls `model.predict(obs_vec, action_masks=mask, deterministic=True)`
-5. Returns action int in [0, 5]
-**Fallback**: If no model found, returns random valid action.
 ---
 ## 7. Next Steps
-### 7.1 Phase 2: Exploration Shaping (Pending)
-- Load `phase1_final.zip`
-- Add `RewardShapingWrapper` with adaptive visit-count bonus
-- Train 500k steps vs random
-- Expected: Higher map coverage, less base-camping, similar win rate
-### 7.2 Phase 3: Curriculum Self-Play (Pending)
-- Load Phase 2 final model
-- Configure 3-team environment
-- Progress through rule-based opponent difficulty
-- Expected: Win rate drops initially, then recovers as curriculum advances
-### 7.3 Evaluation Against Non-Random Opponents (Pending)
-- Evaluate Phase 3 model vs rule-based "smart" opponents
-- Target: > 50% win rate against smart opponents
-- Multi-team evaluation (3-way matches)
-### 7.4 Known Limitations
-- **No recurrent policy**: The MLP policy has no memory of past observations. May struggle with bomb fuse timing or opponent tracking.
-- **No opponent modeling**: The policy treats opponent actions as environment noise. Could benefit from opponent ID or history encoding.
-- **Flattened observations**: Dict observations with spatial structure (viewcones) are flattened into vectors. A CNN policy might exploit spatial patterns better.
-- **Deterministic eval**: Currently uses `deterministic=True` for evaluation. Stochastic evaluation might reveal policy variance.
-### 7.5 Future Improvements
-1. **CNN policy**: Use `CnnPolicy` with 2D viewcone inputs instead of MLP
-2. **LSTM/GRU**: Add memory for temporal opponent tracking
-3. **Self-play proper**: Train both teams simultaneously with shared/policy-separated networks
-4. **Population-based training**: Train a population of agents and evaluate against each other
-5. **Reward decomposition**: Separate rewards for movement, bombs, kills, survival, resource collection
----
-*Last updated: 2026-05-14*
-*Current phase: Phase 1 complete, Phase 2 pending*
-*Author: ML Intern (E-Rong)*

 ### 1.1 Domain: Multi-Agent Bomberman RL
+The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.
+### 1.2 Key Papers
 | Paper | arXiv ID | Key Insight | Relevance |
 |---|---|---|---|
+| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
+| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
+| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
+| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
 ### 1.3 Why MaskablePPO?
+Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
 ### 1.4 Why Curriculum Learning?
+Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.
+### 1.5 Why Not DQN?
+DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.
 ---
 ### 2.1 Environment Structure
+- **Grid size**: 16×16
 - **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
+- **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
+- **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
+- **Episode length**: ~200 steps
 ### 2.2 Observation Flattening
+Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
+### 2.3 Action Masking
+Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.
 ---
 ## 3. Development Decisions
+### 3.1 Single-Agent Wrapper
+Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
+### 3.2 3-Phase Curriculum
 | Phase | Opponent | Duration | Purpose |
 |---|---|---|---|
+| **1** | Random | 500k | Learn movement, bombs, basics |
+| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
+| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |
+### 3.3 Philosophy
+- `stable-baselines3` for PPO core
+- `sb3-contrib` for MaskablePPO + ActionMasker
+- `huggingface_hub` for persistent checkpoint storage
+### 3.4 Why Hub Every 50k Steps
+Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
+---
+## 4. Training Phases
+### 4.1 Phase 1: Foundation (vs Random)
+**Duration**: 500,352 steps
+**Result**: Win rate 92%, avg reward 180.1, 100% survival
+**Challenges**: Wrapper ordering, dependency issues, sandbox resets
+### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
+**Status**: Started at 500352 steps, running on A10G at ~54 FPS
+**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
+**ETA**: ~2.5 hours, targets 1,000,352 total steps
+**Purpose**: Force map exploration, prevent safe base-camping
+### 4.3 Phase 3: Curriculum Self-Play
+**Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps
 ---
 | Metric | Value |
 |---|---|
+| Timesteps | 500,352 |
+| Final Reward | 237.0 |
+| FPS | 52 (A10G) |
+| Wall time | ~2h 15min |
+| Win Rate (eval) | **92.0%** |
+| Avg Reward (eval) | **180.1** |
+| Survival Rate | **100.0%** |
+### 5.2 Phase 2 Interim (Early)
 | Metric | Value |
 |---|---|
+| Starting Step | 500,352 |
+| Initial Reward (shaped) | 210 |
+| FPS | 54 |
+| Explore Weight | Adaptive k=1.2 |
 ---
 ## 6. Artifacts
 | File | Purpose |
 |---|---|
+| `phase1_final.zip` | Trained model |
+| `phase2_final.zip` | *(in progress)* |
+| `ckpt_50000-400000.zip` | Phase 1 intermediates |
+| `ae_manager.py` | Inference code |
+| `docs/ae.md` | This documentation |
 ---
 ## 7. Next Steps
+- **Phase 2**: Complete 500k exploration-shaping steps
+- **Phase 3**: Curriculum vs rule-based opponents (1M steps)
+- **Eval**: Multi-team evaluation vs smart opponents
+- **Future**: CNN policy, opponent modeling, LSTM memory
+*Last updated: 2026-05-14 — Phase 2 in progress*