Update docs: Phase 2 started, add interim results
Browse files- docs/ae.md +64 -260
docs/ae.md
CHANGED
|
@@ -23,43 +23,28 @@
|
|
| 23 |
|
| 24 |
### 1.1 Domain: Multi-Agent Bomberman RL
|
| 25 |
|
| 26 |
-
The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**
|
| 27 |
|
| 28 |
-
### 1.2 Key Papers
|
| 29 |
|
| 30 |
| Paper | arXiv ID | Key Insight | Relevance |
|
| 31 |
|---|---|---|---|
|
| 32 |
-
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 |
|
| 33 |
-
| *
|
| 34 |
-
| *
|
| 35 |
-
| *
|
| 36 |
|
| 37 |
### 1.3 Why MaskablePPO?
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
- Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile
|
| 42 |
-
- The observation includes `action_mask: uint8[6]` — a binary legal-action indicator
|
| 43 |
-
- Standard PPO would waste ~30-40% of samples on illegal moves early in training
|
| 44 |
-
- `sb3-contrib`'s `MaskablePPO` masks logits before softmax, ensuring only legal actions are sampled
|
| 45 |
-
|
| 46 |
-
**Decision**: Use `sb3-contrib`'s `MaskablePPO` with `ActionMasker` wrapper.
|
| 47 |
|
| 48 |
### 1.4 Why Curriculum Learning?
|
| 49 |
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
- Training against strong opponents from scratch leads to **catastrophic early losses** (~0 reward)
|
| 53 |
-
- Curriculum learning (easy → hard) is standard practice in competitive multi-agent RL
|
| 54 |
-
- Rule-based opponents at increasing difficulty provide stable reward signals during learning
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
- DQN struggles with action masking (requires custom architecture)
|
| 61 |
-
- PPO's on-policy updates handle the non-stationarity of multi-agent self-play better
|
| 62 |
-
- PPO is simpler to tune and has mature invalid-action-masking support in `sb3-contrib`
|
| 63 |
|
| 64 |
---
|
| 65 |
|
|
@@ -67,176 +52,66 @@ From `arxiv:2103.01955` (MAPPO) and Pommerman benchmarks, we learned:
|
|
| 67 |
|
| 68 |
### 2.1 Environment Structure
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
- **Grid size**: 16×16 (confirmed from `default_config()`)
|
| 73 |
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
|
| 74 |
-
- **Observations**: Dict with
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
- `direction`: Discrete(4) — facing
|
| 78 |
-
- `location`, `base_location`, `health`, `frozen_ticks`, `base_health`, `team_resources`, `team_bombs`, `step`
|
| 79 |
-
- `action_mask`: uint8[6] — binary legality mask
|
| 80 |
-
- **Actions**: Discrete(6)
|
| 81 |
-
- 0 = FORWARD, 1 = BACKWARD, 2 = LEFT, 3 = RIGHT, 4 = STAY, 5 = PLACE_BOMB
|
| 82 |
-
- **Episode length**: ~200 steps (observed during training)
|
| 83 |
|
| 84 |
### 2.2 Observation Flattening
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
| 89 |
-
agent_viewcone: 7 × 5 × 25 = 875
|
| 90 |
-
base_viewcone: 5 × 5 × 25 = 625
|
| 91 |
-
direction: 1
|
| 92 |
-
location: 2
|
| 93 |
-
base_location: 2
|
| 94 |
-
health: 1
|
| 95 |
-
frozen_ticks: 1
|
| 96 |
-
base_health: 1
|
| 97 |
-
team_resources: 1
|
| 98 |
-
team_bombs: 1
|
| 99 |
-
step: 1
|
| 100 |
-
─────────────────────────────
|
| 101 |
-
TOTAL: 1511
|
| 102 |
-
```
|
| 103 |
|
| 104 |
-
|
| 105 |
-
|
| 106 |
-
### 2.3 Action Masking Implementation
|
| 107 |
-
|
| 108 |
-
```python
|
| 109 |
-
env = ActionMasker(base_env, lambda e: e.action_masks())
|
| 110 |
-
```
|
| 111 |
-
|
| 112 |
-
The wrapper exposes `action_masks()` which returns a bool[6] array. `MaskablePPO` uses this internally via `sb3_contrib`'s `get_action_masks()` during rollout collection.
|
| 113 |
-
|
| 114 |
-
**Critical bug found**: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. We fixed this ordering issue during development.
|
| 115 |
|
| 116 |
---
|
| 117 |
|
| 118 |
## 3. Development Decisions
|
| 119 |
|
| 120 |
-
### 3.1
|
| 121 |
-
|
| 122 |
-
The TIL environment is inherently multi-agent (PettingZoo AEC). However, for the AE challenge, we only control **agent_0**; opponents use fixed policies. We wrapped the parallel PettingZoo env into a `gymnasium.Env` that:
|
| 123 |
|
| 124 |
-
-
|
| 125 |
-
- Returns only agent_0's observation/reward/done
|
| 126 |
-
- Uses random valid actions for opponents (Phase 1-2) or rule-based policies (Phase 3)
|
| 127 |
|
| 128 |
-
|
| 129 |
-
|
| 130 |
-
### 3.2 Why 3-Phase Curriculum?
|
| 131 |
|
| 132 |
| Phase | Opponent | Duration | Purpose |
|
| 133 |
|---|---|---|---|
|
| 134 |
-
| **1** | Random
|
| 135 |
-
| **2** | Random + exploration
|
| 136 |
-
| **3** | Rule-based
|
| 137 |
-
|
| 138 |
-
**Phase 1 vs Random** gives the agent a chance to learn fundamentals without being immediately killed by competent opponents. Random opponents still place bombs and move, providing exposure to explosion mechanics.
|
| 139 |
-
|
| 140 |
-
**Phase 2 Exploration Shaping** addresses a known issue: agents learn to survive by staying near their base and waiting for random opponents to walk into bombs. The visit-count bonus (`1/(1+visits)`) forces the agent to explore new tiles.
|
| 141 |
|
| 142 |
-
|
| 143 |
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|---|---|
|
| 148 |
-
| `stable-baselines3` | Core PPO implementation, callbacks, Monitor, checkpoints |
|
| 149 |
-
| `sb3-contrib` | `MaskablePPO`, `ActionMasker`, invalid-action masking utilities |
|
| 150 |
-
| `gymnasium` | Env API (observation/action spaces, step/reset) |
|
| 151 |
-
| `pettingzoo` | Multi-agent env conversion (`aec_to_parallel`) |
|
| 152 |
-
| `huggingface_hub` | Push checkpoints to persistent storage |
|
| 153 |
|
| 154 |
-
### 3.4 Why
|
| 155 |
|
| 156 |
-
|
| 157 |
|
| 158 |
-
|
| 159 |
-
- Local: `CheckpointCallback(save_freq=50000)`
|
| 160 |
-
- Hub: Custom callback calling `HfApi.upload_file()` every 50k steps
|
| 161 |
|
| 162 |
-
|
| 163 |
|
| 164 |
-
###
|
| 165 |
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
|
| 169 |
-
| A10G | ~52 | ~32 min |
|
| 170 |
|
| 171 |
-
|
| 172 |
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
-
## 4.
|
| 176 |
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
**Duration**: 500,000 steps
|
| 180 |
-
**Opponent**: Random valid actions
|
| 181 |
-
**Environment**: 2 teams, agent_0 vs random
|
| 182 |
-
**Hyperparameters**:
|
| 183 |
-
```python
|
| 184 |
-
MaskablePPO(
|
| 185 |
-
"MlpPolicy", env,
|
| 186 |
-
learning_rate=3e-4,
|
| 187 |
-
n_steps=2048,
|
| 188 |
-
batch_size=64,
|
| 189 |
-
n_epochs=10,
|
| 190 |
-
gamma=0.99,
|
| 191 |
-
gae_lambda=0.95,
|
| 192 |
-
clip_range=0.2,
|
| 193 |
-
ent_coef=0.01, # Encourage exploration early
|
| 194 |
-
vf_coef=0.5,
|
| 195 |
-
max_grad_norm=0.5,
|
| 196 |
-
device="cuda",
|
| 197 |
-
)
|
| 198 |
-
```
|
| 199 |
-
|
| 200 |
-
**Purpose**: Learn basic survival, map layout, bomb mechanics, and opponent interaction. Random opponents provide a low-stakes environment where the agent can experiment without being immediately eliminated.
|
| 201 |
-
|
| 202 |
-
**Challenges encountered**:
|
| 203 |
-
- Initial wrapper ordering bug (`Monitor` inside `ActionMasker`)
|
| 204 |
-
- Missing dependencies (`omegaconf`, `perlin_noise`) in fresh sandboxes
|
| 205 |
-
- Sandbox resets — resolved by Hub checkpointing
|
| 206 |
-
|
| 207 |
-
### 4.2 Phase 2: Exploration Shaping (Adaptive Annealing)
|
| 208 |
-
|
| 209 |
-
**Duration**: 500,000 steps
|
| 210 |
-
**Opponent**: Random valid actions
|
| 211 |
-
**Environment**: 2 teams + exploration bonus
|
| 212 |
-
**Mechanism**:
|
| 213 |
-
```python
|
| 214 |
-
# Visit-count bonus
|
| 215 |
-
visit_bonus = 1.0 / (1.0 + visit_counts[x, y])
|
| 216 |
-
|
| 217 |
-
# Adaptive annealing
|
| 218 |
-
alpha = 1.0 - tanh(k * avg_enemy_deaths)
|
| 219 |
-
explore_weight = base_weight * max(0.1, alpha)
|
| 220 |
-
```
|
| 221 |
-
|
| 222 |
-
As the agent gets better at killing enemies, the exploration bonus fades, shifting focus toward combat optimization.
|
| 223 |
-
|
| 224 |
-
**Purpose**: Prevent the "camping" exploit where agents hide near their base and wait. Force proactive map exploration and resource collection.
|
| 225 |
-
|
| 226 |
-
### 4.3 Phase 3: Curriculum Self-Play (Rule-Based Opponents)
|
| 227 |
-
|
| 228 |
-
**Duration**: 1,000,000 steps
|
| 229 |
-
**Opponent**: Rule-based with curriculum difficulty
|
| 230 |
-
**Environment**: 3 teams
|
| 231 |
-
**Curriculum stages**:
|
| 232 |
-
1. **Static**: Opponents do nothing (STAY)
|
| 233 |
-
2. **Simple**: Bomb when enemy in viewcone, otherwise random move
|
| 234 |
-
3. **Smart**: Score-based action selection (collectibles + wall avoidance)
|
| 235 |
-
4. **Mixed**: Half smart, half simple
|
| 236 |
-
|
| 237 |
-
**Advancement condition**: Win rate ≥ 55% over 500 episodes, or max 500 episodes per stage.
|
| 238 |
-
|
| 239 |
-
**Purpose**: Generalize to structured, multi-team competition. The curriculum ensures the agent doesn't face a difficulty cliff when switching from random to competent opponents.
|
| 240 |
|
| 241 |
---
|
| 242 |
|
|
@@ -246,113 +121,42 @@ As the agent gets better at killing enemies, the exploration bonus fades, shifti
|
|
| 246 |
|
| 247 |
| Metric | Value |
|
| 248 |
|---|---|
|
| 249 |
-
|
|
| 250 |
-
|
|
| 251 |
-
|
|
| 252 |
-
|
|
| 253 |
-
| **
|
|
|
|
|
|
|
| 254 |
|
| 255 |
-
### 5.2 Phase
|
| 256 |
|
| 257 |
| Metric | Value |
|
| 258 |
|---|---|
|
| 259 |
-
|
|
| 260 |
-
|
|
| 261 |
-
|
|
| 262 |
-
|
|
| 263 |
-
| **Survival Rate** | **100.0%** |
|
| 264 |
-
|
| 265 |
-
**Interpretation**: The agent has mastered the basics against random opponents. It consistently survives full episodes, places bombs frequently, and wins nearly every match. The gap between training reward (237) and eval reward (180) suggests some reward shaping (e.g., exploration bonus) during training that doesn't transfer to deterministic eval.
|
| 266 |
-
|
| 267 |
-
### 5.3 Training Reward Trajectory (Phase 1)
|
| 268 |
-
|
| 269 |
-
| Steps | Episode Reward | Notes |
|
| 270 |
-
|---|---|---|
|
| 271 |
-
| 2,048 | 41.4 | Initial random policy |
|
| 272 |
-
| 20,480 | ~104 | Learning movement |
|
| 273 |
-
| 53,248 | ~116 | First checkpoint |
|
| 274 |
-
| 110,592 | ~159 | Consistent improvement |
|
| 275 |
-
| 204,096 | ~219 | Strong policy emerging |
|
| 276 |
-
| 306,496 | ~203 | Slight dip (exploration) |
|
| 277 |
-
| 416,384 | ~224 | Convergence |
|
| 278 |
-
| 500,352 | 237.0 | Final |
|
| 279 |
|
| 280 |
---
|
| 281 |
|
| 282 |
## 6. Artifacts
|
| 283 |
|
| 284 |
-
### 6.1 Model Repo (`E-Rong/til-26-ae-agent`)
|
| 285 |
-
|
| 286 |
-
| File | Purpose |
|
| 287 |
-
|---|---|
|
| 288 |
-
| `phase1_final.zip` | Phase 1 trained model (500k steps) |
|
| 289 |
-
| `ckpt_50000.zip` – `ckpt_400000.zip` | Intermediate checkpoints |
|
| 290 |
-
| `ae_manager.py` | Inference code for AE server |
|
| 291 |
-
| `phase1_eval_results.txt` | Raw evaluation numbers |
|
| 292 |
-
| `phase1_summary.txt` | This summary (abridged) |
|
| 293 |
-
| `train_all_phases.py` | Full training script |
|
| 294 |
-
| `train_in_space.py` | Space-compatible training script |
|
| 295 |
-
| `requirements.txt` | Python dependencies |
|
| 296 |
-
|
| 297 |
-
### 6.2 Space Integration (`e-rong/til-26-ae`)
|
| 298 |
-
|
| 299 |
| File | Purpose |
|
| 300 |
|---|---|
|
| 301 |
-
| `
|
| 302 |
-
| `
|
| 303 |
-
| `
|
| 304 |
-
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
1. AE server receives `POST /ae` with observation dict
|
| 308 |
-
2. `AEManager.ae(observation)` flattens obs → 1511-dim vector
|
| 309 |
-
3. Loads `MaskablePPO` from `phase1_final.zip` (cached at `/workspace/models/`)
|
| 310 |
-
4. Calls `model.predict(obs_vec, action_masks=mask, deterministic=True)`
|
| 311 |
-
5. Returns action int in [0, 5]
|
| 312 |
-
|
| 313 |
-
**Fallback**: If no model found, returns random valid action.
|
| 314 |
|
| 315 |
---
|
| 316 |
|
| 317 |
## 7. Next Steps
|
| 318 |
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
-
|
| 322 |
-
-
|
| 323 |
-
- Train 500k steps vs random
|
| 324 |
-
- Expected: Higher map coverage, less base-camping, similar win rate
|
| 325 |
-
|
| 326 |
-
### 7.2 Phase 3: Curriculum Self-Play (Pending)
|
| 327 |
-
|
| 328 |
-
- Load Phase 2 final model
|
| 329 |
-
- Configure 3-team environment
|
| 330 |
-
- Progress through rule-based opponent difficulty
|
| 331 |
-
- Expected: Win rate drops initially, then recovers as curriculum advances
|
| 332 |
-
|
| 333 |
-
### 7.3 Evaluation Against Non-Random Opponents (Pending)
|
| 334 |
-
|
| 335 |
-
- Evaluate Phase 3 model vs rule-based "smart" opponents
|
| 336 |
-
- Target: > 50% win rate against smart opponents
|
| 337 |
-
- Multi-team evaluation (3-way matches)
|
| 338 |
-
|
| 339 |
-
### 7.4 Known Limitations
|
| 340 |
-
|
| 341 |
-
- **No recurrent policy**: The MLP policy has no memory of past observations. May struggle with bomb fuse timing or opponent tracking.
|
| 342 |
-
- **No opponent modeling**: The policy treats opponent actions as environment noise. Could benefit from opponent ID or history encoding.
|
| 343 |
-
- **Flattened observations**: Dict observations with spatial structure (viewcones) are flattened into vectors. A CNN policy might exploit spatial patterns better.
|
| 344 |
-
- **Deterministic eval**: Currently uses `deterministic=True` for evaluation. Stochastic evaluation might reveal policy variance.
|
| 345 |
-
|
| 346 |
-
### 7.5 Future Improvements
|
| 347 |
-
|
| 348 |
-
1. **CNN policy**: Use `CnnPolicy` with 2D viewcone inputs instead of MLP
|
| 349 |
-
2. **LSTM/GRU**: Add memory for temporal opponent tracking
|
| 350 |
-
3. **Self-play proper**: Train both teams simultaneously with shared/policy-separated networks
|
| 351 |
-
4. **Population-based training**: Train a population of agents and evaluate against each other
|
| 352 |
-
5. **Reward decomposition**: Separate rewards for movement, bombs, kills, survival, resource collection
|
| 353 |
-
|
| 354 |
-
---
|
| 355 |
|
| 356 |
-
*Last updated: 2026-05-14*
|
| 357 |
-
*Current phase: Phase 1 complete, Phase 2 pending*
|
| 358 |
-
*Author: ML Intern (E-Rong)*
|
|
|
|
| 23 |
|
| 24 |
### 1.1 Domain: Multi-Agent Bomberman RL
|
| 25 |
|
| 26 |
+
The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.
|
| 27 |
|
| 28 |
+
### 1.2 Key Papers
|
| 29 |
|
| 30 |
| Paper | arXiv ID | Key Insight | Relevance |
|
| 31 |
|---|---|---|---|
|
| 32 |
+
| *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
|
| 33 |
+
| *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
|
| 34 |
+
| *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
|
| 35 |
+
| *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
|
| 36 |
|
| 37 |
### 1.3 Why MaskablePPO?
|
| 38 |
|
| 39 |
+
Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
### 1.4 Why Curriculum Learning?
|
| 42 |
|
| 43 |
+
Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
+
### 1.5 Why Not DQN?
|
| 46 |
|
| 47 |
+
DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
---
|
| 50 |
|
|
|
|
| 52 |
|
| 53 |
### 2.1 Environment Structure
|
| 54 |
|
| 55 |
+
- **Grid size**: 16×16
|
|
|
|
|
|
|
| 56 |
- **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
|
| 57 |
+
- **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
|
| 58 |
+
- **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
|
| 59 |
+
- **Episode length**: ~200 steps
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
### 2.2 Observation Flattening
|
| 62 |
|
| 63 |
+
Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
|
| 64 |
|
| 65 |
+
### 2.3 Action Masking
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
---
|
| 70 |
|
| 71 |
## 3. Development Decisions
|
| 72 |
|
| 73 |
+
### 3.1 Single-Agent Wrapper
|
|
|
|
|
|
|
| 74 |
|
| 75 |
+
Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
|
|
|
|
|
|
|
| 76 |
|
| 77 |
+
### 3.2 3-Phase Curriculum
|
|
|
|
|
|
|
| 78 |
|
| 79 |
| Phase | Opponent | Duration | Purpose |
|
| 80 |
|---|---|---|---|
|
| 81 |
+
| **1** | Random | 500k | Learn movement, bombs, basics |
|
| 82 |
+
| **2** | Random + exploration bonus | 500k | Prevent camping exploit |
|
| 83 |
+
| **3** | Rule-based curriculum | 1M | Generalize to structured opponents |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
+
### 3.3 Philosophy
|
| 86 |
|
| 87 |
+
- `stable-baselines3` for PPO core
|
| 88 |
+
- `sb3-contrib` for MaskablePPO + ActionMasker
|
| 89 |
+
- `huggingface_hub` for persistent checkpoint storage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
|
| 91 |
+
### 3.4 Why Hub Every 50k Steps
|
| 92 |
|
| 93 |
+
Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
|
| 94 |
|
| 95 |
+
---
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
## 4. Training Phases
|
| 98 |
|
| 99 |
+
### 4.1 Phase 1: Foundation (vs Random)
|
| 100 |
|
| 101 |
+
**Duration**: 500,352 steps
|
| 102 |
+
**Result**: Win rate 92%, avg reward 180.1, 100% survival
|
| 103 |
+
**Challenges**: Wrapper ordering, dependency issues, sandbox resets
|
|
|
|
| 104 |
|
| 105 |
+
### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
|
| 106 |
|
| 107 |
+
**Status**: Started at 500352 steps, running on A10G at ~54 FPS
|
| 108 |
+
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
|
| 109 |
+
**ETA**: ~2.5 hours, targets 1,000,352 total steps
|
| 110 |
+
**Purpose**: Force map exploration, prevent safe base-camping
|
| 111 |
|
| 112 |
+
### 4.3 Phase 3: Curriculum Self-Play
|
| 113 |
|
| 114 |
+
**Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
---
|
| 117 |
|
|
|
|
| 121 |
|
| 122 |
| Metric | Value |
|
| 123 |
|---|---|
|
| 124 |
+
| Timesteps | 500,352 |
|
| 125 |
+
| Final Reward | 237.0 |
|
| 126 |
+
| FPS | 52 (A10G) |
|
| 127 |
+
| Wall time | ~2h 15min |
|
| 128 |
+
| Win Rate (eval) | **92.0%** |
|
| 129 |
+
| Avg Reward (eval) | **180.1** |
|
| 130 |
+
| Survival Rate | **100.0%** |
|
| 131 |
|
| 132 |
+
### 5.2 Phase 2 Interim (Early)
|
| 133 |
|
| 134 |
| Metric | Value |
|
| 135 |
|---|---|
|
| 136 |
+
| Starting Step | 500,352 |
|
| 137 |
+
| Initial Reward (shaped) | 210 |
|
| 138 |
+
| FPS | 54 |
|
| 139 |
+
| Explore Weight | Adaptive k=1.2 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
---
|
| 142 |
|
| 143 |
## 6. Artifacts
|
| 144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
| File | Purpose |
|
| 146 |
|---|---|
|
| 147 |
+
| `phase1_final.zip` | Trained model |
|
| 148 |
+
| `phase2_final.zip` | *(in progress)* |
|
| 149 |
+
| `ckpt_50000-400000.zip` | Phase 1 intermediates |
|
| 150 |
+
| `ae_manager.py` | Inference code |
|
| 151 |
+
| `docs/ae.md` | This documentation |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 152 |
|
| 153 |
---
|
| 154 |
|
| 155 |
## 7. Next Steps
|
| 156 |
|
| 157 |
+
- **Phase 2**: Complete 500k exploration-shaping steps
|
| 158 |
+
- **Phase 3**: Curriculum vs rule-based opponents (1M steps)
|
| 159 |
+
- **Eval**: Multi-team evaluation vs smart opponents
|
| 160 |
+
- **Future**: CNN policy, opponent modeling, LSTM memory
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
+
*Last updated: 2026-05-14 — Phase 2 in progress*
|
|
|
|
|
|