E-Rong commited on
Commit
a00144d
·
verified ·
1 Parent(s): 5c6cad0

Update docs: Phase 2 started, add interim results

Browse files
Files changed (1) hide show
  1. docs/ae.md +64 -260
docs/ae.md CHANGED
@@ -23,43 +23,28 @@
23
 
24
  ### 1.1 Domain: Multi-Agent Bomberman RL
25
 
26
- The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration** — agents must learn to navigate, compete, and survive without hand-crafted heuristics.
27
 
28
- ### 1.2 Key Papers Consulted
29
 
30
  | Paper | arXiv ID | Key Insight | Relevance |
31
  |---|---|---|---|
32
- | *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | Multi-agent competitive environment similar to Bomberman; MAPPO baseline performance | Confirmed PettingZoo + parallel env as standard approach |
33
- | *The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games* | 2103.01955 | MAPPO with shared parameters, role-specific conditioning | Justified single-agent wrapper with self-play curriculum |
34
- | *Superstition, Imagination, and the Invalid Action Problem* | 2006.14171 | Invalid action masking improves sample efficiency dramatically in discrete action spaces with legal action constraints | **Directly applicable** — Bomberman has wall/edge constraints |
35
- | *Proximal Policy Optimization Algorithms* | 1707.06347 | PPO with clipped surrogate objective, stable and scalable | Chosen over DQN for continuous policy updates and easier masking integration |
36
 
37
  ### 1.3 Why MaskablePPO?
38
 
39
- After reading `arxiv:2006.14171`, we identified that **invalid action masking** is critical for this domain:
40
-
41
- - Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile
42
- - The observation includes `action_mask: uint8[6]` — a binary legal-action indicator
43
- - Standard PPO would waste ~30-40% of samples on illegal moves early in training
44
- - `sb3-contrib`'s `MaskablePPO` masks logits before softmax, ensuring only legal actions are sampled
45
-
46
- **Decision**: Use `sb3-contrib`'s `MaskablePPO` with `ActionMasker` wrapper.
47
 
48
  ### 1.4 Why Curriculum Learning?
49
 
50
- From `arxiv:2103.01955` (MAPPO) and Pommerman benchmarks, we learned:
51
-
52
- - Training against strong opponents from scratch leads to **catastrophic early losses** (~0 reward)
53
- - Curriculum learning (easy → hard) is standard practice in competitive multi-agent RL
54
- - Rule-based opponents at increasing difficulty provide stable reward signals during learning
55
 
56
- **Decision**: Implement a 3-phase curriculum with adaptive difficulty gating.
57
 
58
- ### 1.5 Why Not DQN / Rainbow?
59
-
60
- - DQN struggles with action masking (requires custom architecture)
61
- - PPO's on-policy updates handle the non-stationarity of multi-agent self-play better
62
- - PPO is simpler to tune and has mature invalid-action-masking support in `sb3-contrib`
63
 
64
  ---
65
 
@@ -67,176 +52,66 @@ From `arxiv:2103.01955` (MAPPO) and Pommerman benchmarks, we learned:
67
 
68
  ### 2.1 Environment Structure
69
 
70
- The `til-26-ae` environment (`e-rong/til-26-ae` Space) is a PettingZoo-style AEC (Agent-Environment-Commutative) multi-agent game:
71
-
72
- - **Grid size**: 16×16 (confirmed from `default_config()`)
73
  - **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
74
- - **Observations**: Dict with:
75
- - `agent_viewcone`: float32 [7×5×25] — agent-facing view
76
- - `base_viewcone`: float32 [5×5×25] — base-centered view
77
- - `direction`: Discrete(4) — facing
78
- - `location`, `base_location`, `health`, `frozen_ticks`, `base_health`, `team_resources`, `team_bombs`, `step`
79
- - `action_mask`: uint8[6] — binary legality mask
80
- - **Actions**: Discrete(6)
81
- - 0 = FORWARD, 1 = BACKWARD, 2 = LEFT, 3 = RIGHT, 4 = STAY, 5 = PLACE_BOMB
82
- - **Episode length**: ~200 steps (observed during training)
83
 
84
  ### 2.2 Observation Flattening
85
 
86
- We flatten the dict observation into a **1511-dim vector**:
87
 
88
- ```
89
- agent_viewcone: 7 × 5 × 25 = 875
90
- base_viewcone: 5 × 5 × 25 = 625
91
- direction: 1
92
- location: 2
93
- base_location: 2
94
- health: 1
95
- frozen_ticks: 1
96
- base_health: 1
97
- team_resources: 1
98
- team_bombs: 1
99
- step: 1
100
- ─────────────────────────────
101
- TOTAL: 1511
102
- ```
103
 
104
- This matches the MLP policy input in `MaskablePPO("MlpPolicy", ...)`.
105
-
106
- ### 2.3 Action Masking Implementation
107
-
108
- ```python
109
- env = ActionMasker(base_env, lambda e: e.action_masks())
110
- ```
111
-
112
- The wrapper exposes `action_masks()` which returns a bool[6] array. `MaskablePPO` uses this internally via `sb3_contrib`'s `get_action_masks()` during rollout collection.
113
-
114
- **Critical bug found**: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`. We fixed this ordering issue during development.
115
 
116
  ---
117
 
118
  ## 3. Development Decisions
119
 
120
- ### 3.1 Why Single-Agent Wrapper?
121
-
122
- The TIL environment is inherently multi-agent (PettingZoo AEC). However, for the AE challenge, we only control **agent_0**; opponents use fixed policies. We wrapped the parallel PettingZoo env into a `gymnasium.Env` that:
123
 
124
- - Runs the full multi-agent step
125
- - Returns only agent_0's observation/reward/done
126
- - Uses random valid actions for opponents (Phase 1-2) or rule-based policies (Phase 3)
127
 
128
- This reduces the problem to single-agent RL with a non-stationary environment (opponent policies change between phases).
129
-
130
- ### 3.2 Why 3-Phase Curriculum?
131
 
132
  | Phase | Opponent | Duration | Purpose |
133
  |---|---|---|---|
134
- | **1** | Random valid actions | 500k steps | Learn basic movement, bomb mechanics, map navigation |
135
- | **2** | Random + exploration shaping | 500k steps | Prevent "camping" exploit; encourage full map coverage |
136
- | **3** | Rule-based (curriculum) | 1M steps | Generalize to structured opponents; scale to multi-team |
137
-
138
- **Phase 1 vs Random** gives the agent a chance to learn fundamentals without being immediately killed by competent opponents. Random opponents still place bombs and move, providing exposure to explosion mechanics.
139
-
140
- **Phase 2 Exploration Shaping** addresses a known issue: agents learn to survive by staying near their base and waiting for random opponents to walk into bombs. The visit-count bonus (`1/(1+visits)`) forces the agent to explore new tiles.
141
 
142
- **Phase 3 Curriculum** transitions from random to structured opponents using a difficulty ladder: static → simple → smart → mixed. This mirrors how humans learn and prevents the "forgetting" problem when suddenly switching opponent types.
143
 
144
- ### 3.3 Why Stable-Baselines3 + sb3-contrib?
145
-
146
- | Library | Role |
147
- |---|---|
148
- | `stable-baselines3` | Core PPO implementation, callbacks, Monitor, checkpoints |
149
- | `sb3-contrib` | `MaskablePPO`, `ActionMasker`, invalid-action masking utilities |
150
- | `gymnasium` | Env API (observation/action spaces, step/reset) |
151
- | `pettingzoo` | Multi-agent env conversion (`aec_to_parallel`) |
152
- | `huggingface_hub` | Push checkpoints to persistent storage |
153
 
154
- ### 3.4 Why Push Checkpoints to Hub Every 50k Steps?
155
 
156
- During development, we encountered **sandbox resets** (T4 container recycled unexpectedly). Local `/app/data/` was lost, but the Hub model repo (`E-Rong/til-26-ae-agent`) persisted.
157
 
158
- **Decision**: Implement a dual-save strategy:
159
- - Local: `CheckpointCallback(save_freq=50000)`
160
- - Hub: Custom callback calling `HfApi.upload_file()` every 50k steps
161
 
162
- This saved the project when the sandbox reset at 400k steps — we resumed from `ckpt_400000.zip` on the Hub without losing progress.
163
 
164
- ### 3.5 Why A10G over T4?
165
 
166
- | Hardware | FPS | Time for 100k steps |
167
- |---|---|---|
168
- | T4 | ~42 | ~40 min |
169
- | A10G | ~52 | ~32 min |
170
 
171
- A10G provided more stable performance and the same 24GB VRAM. Given the ~2M total steps across 3 phases, A10G saves ~2 hours total.
172
 
173
- ---
 
 
 
174
 
175
- ## 4. Training Phases
176
 
177
- ### 4.1 Phase 1: Foundation (MaskablePPO vs Random)
178
-
179
- **Duration**: 500,000 steps
180
- **Opponent**: Random valid actions
181
- **Environment**: 2 teams, agent_0 vs random
182
- **Hyperparameters**:
183
- ```python
184
- MaskablePPO(
185
- "MlpPolicy", env,
186
- learning_rate=3e-4,
187
- n_steps=2048,
188
- batch_size=64,
189
- n_epochs=10,
190
- gamma=0.99,
191
- gae_lambda=0.95,
192
- clip_range=0.2,
193
- ent_coef=0.01, # Encourage exploration early
194
- vf_coef=0.5,
195
- max_grad_norm=0.5,
196
- device="cuda",
197
- )
198
- ```
199
-
200
- **Purpose**: Learn basic survival, map layout, bomb mechanics, and opponent interaction. Random opponents provide a low-stakes environment where the agent can experiment without being immediately eliminated.
201
-
202
- **Challenges encountered**:
203
- - Initial wrapper ordering bug (`Monitor` inside `ActionMasker`)
204
- - Missing dependencies (`omegaconf`, `perlin_noise`) in fresh sandboxes
205
- - Sandbox resets — resolved by Hub checkpointing
206
-
207
- ### 4.2 Phase 2: Exploration Shaping (Adaptive Annealing)
208
-
209
- **Duration**: 500,000 steps
210
- **Opponent**: Random valid actions
211
- **Environment**: 2 teams + exploration bonus
212
- **Mechanism**:
213
- ```python
214
- # Visit-count bonus
215
- visit_bonus = 1.0 / (1.0 + visit_counts[x, y])
216
-
217
- # Adaptive annealing
218
- alpha = 1.0 - tanh(k * avg_enemy_deaths)
219
- explore_weight = base_weight * max(0.1, alpha)
220
- ```
221
-
222
- As the agent gets better at killing enemies, the exploration bonus fades, shifting focus toward combat optimization.
223
-
224
- **Purpose**: Prevent the "camping" exploit where agents hide near their base and wait. Force proactive map exploration and resource collection.
225
-
226
- ### 4.3 Phase 3: Curriculum Self-Play (Rule-Based Opponents)
227
-
228
- **Duration**: 1,000,000 steps
229
- **Opponent**: Rule-based with curriculum difficulty
230
- **Environment**: 3 teams
231
- **Curriculum stages**:
232
- 1. **Static**: Opponents do nothing (STAY)
233
- 2. **Simple**: Bomb when enemy in viewcone, otherwise random move
234
- 3. **Smart**: Score-based action selection (collectibles + wall avoidance)
235
- 4. **Mixed**: Half smart, half simple
236
-
237
- **Advancement condition**: Win rate ≥ 55% over 500 episodes, or max 500 episodes per stage.
238
-
239
- **Purpose**: Generalize to structured, multi-team competition. The curriculum ensures the agent doesn't face a difficulty cliff when switching from random to competent opponents.
240
 
241
  ---
242
 
@@ -246,113 +121,42 @@ As the agent gets better at killing enemies, the exploration bonus fades, shifti
246
 
247
  | Metric | Value |
248
  |---|---|
249
- | **Timesteps** | 500,352 |
250
- | **Final Training Reward** | 237.0 |
251
- | **FPS** | 52 (A10G) |
252
- | **Total wall time** | ~2h 15min |
253
- | **Checkpoints** | ckpt_50000, 100000, 150000, 200000, 250000, 300000, 350000, 400000 + phase1_final |
 
 
254
 
255
- ### 5.2 Phase 1 Evaluation (100 Episodes vs Random Opponents)
256
 
257
  | Metric | Value |
258
  |---|---|
259
- | **Win Rate** | **92.0%** (92/100) |
260
- | **Average Reward** | **180.1** |
261
- | **Average Episode Length** | 200.0 steps |
262
- | **Average Bombs/Episode** | 20.4 |
263
- | **Survival Rate** | **100.0%** |
264
-
265
- **Interpretation**: The agent has mastered the basics against random opponents. It consistently survives full episodes, places bombs frequently, and wins nearly every match. The gap between training reward (237) and eval reward (180) suggests some reward shaping (e.g., exploration bonus) during training that doesn't transfer to deterministic eval.
266
-
267
- ### 5.3 Training Reward Trajectory (Phase 1)
268
-
269
- | Steps | Episode Reward | Notes |
270
- |---|---|---|
271
- | 2,048 | 41.4 | Initial random policy |
272
- | 20,480 | ~104 | Learning movement |
273
- | 53,248 | ~116 | First checkpoint |
274
- | 110,592 | ~159 | Consistent improvement |
275
- | 204,096 | ~219 | Strong policy emerging |
276
- | 306,496 | ~203 | Slight dip (exploration) |
277
- | 416,384 | ~224 | Convergence |
278
- | 500,352 | 237.0 | Final |
279
 
280
  ---
281
 
282
  ## 6. Artifacts
283
 
284
- ### 6.1 Model Repo (`E-Rong/til-26-ae-agent`)
285
-
286
- | File | Purpose |
287
- |---|---|
288
- | `phase1_final.zip` | Phase 1 trained model (500k steps) |
289
- | `ckpt_50000.zip` – `ckpt_400000.zip` | Intermediate checkpoints |
290
- | `ae_manager.py` | Inference code for AE server |
291
- | `phase1_eval_results.txt` | Raw evaluation numbers |
292
- | `phase1_summary.txt` | This summary (abridged) |
293
- | `train_all_phases.py` | Full training script |
294
- | `train_in_space.py` | Space-compatible training script |
295
- | `requirements.txt` | Python dependencies |
296
-
297
- ### 6.2 Space Integration (`e-rong/til-26-ae`)
298
-
299
  | File | Purpose |
300
  |---|---|
301
- | `ae/src/ae_manager.py` | Loads `phase1_final.zip` from Hub, serves actions via `/ae` endpoint |
302
- | `ae/requirements.txt` | `sb3-contrib`, `torch`, `huggingface_hub` |
303
- | `ae/Dockerfile` | Standard Python 3.11 image, CPU-only for fast eval startup |
304
-
305
- ### 6.3 How Inference Works
306
-
307
- 1. AE server receives `POST /ae` with observation dict
308
- 2. `AEManager.ae(observation)` flattens obs → 1511-dim vector
309
- 3. Loads `MaskablePPO` from `phase1_final.zip` (cached at `/workspace/models/`)
310
- 4. Calls `model.predict(obs_vec, action_masks=mask, deterministic=True)`
311
- 5. Returns action int in [0, 5]
312
-
313
- **Fallback**: If no model found, returns random valid action.
314
 
315
  ---
316
 
317
  ## 7. Next Steps
318
 
319
- ### 7.1 Phase 2: Exploration Shaping (Pending)
320
-
321
- - Load `phase1_final.zip`
322
- - Add `RewardShapingWrapper` with adaptive visit-count bonus
323
- - Train 500k steps vs random
324
- - Expected: Higher map coverage, less base-camping, similar win rate
325
-
326
- ### 7.2 Phase 3: Curriculum Self-Play (Pending)
327
-
328
- - Load Phase 2 final model
329
- - Configure 3-team environment
330
- - Progress through rule-based opponent difficulty
331
- - Expected: Win rate drops initially, then recovers as curriculum advances
332
-
333
- ### 7.3 Evaluation Against Non-Random Opponents (Pending)
334
-
335
- - Evaluate Phase 3 model vs rule-based "smart" opponents
336
- - Target: > 50% win rate against smart opponents
337
- - Multi-team evaluation (3-way matches)
338
-
339
- ### 7.4 Known Limitations
340
-
341
- - **No recurrent policy**: The MLP policy has no memory of past observations. May struggle with bomb fuse timing or opponent tracking.
342
- - **No opponent modeling**: The policy treats opponent actions as environment noise. Could benefit from opponent ID or history encoding.
343
- - **Flattened observations**: Dict observations with spatial structure (viewcones) are flattened into vectors. A CNN policy might exploit spatial patterns better.
344
- - **Deterministic eval**: Currently uses `deterministic=True` for evaluation. Stochastic evaluation might reveal policy variance.
345
-
346
- ### 7.5 Future Improvements
347
-
348
- 1. **CNN policy**: Use `CnnPolicy` with 2D viewcone inputs instead of MLP
349
- 2. **LSTM/GRU**: Add memory for temporal opponent tracking
350
- 3. **Self-play proper**: Train both teams simultaneously with shared/policy-separated networks
351
- 4. **Population-based training**: Train a population of agents and evaluate against each other
352
- 5. **Reward decomposition**: Separate rewards for movement, bombs, kills, survival, resource collection
353
-
354
- ---
355
 
356
- *Last updated: 2026-05-14*
357
- *Current phase: Phase 1 complete, Phase 2 pending*
358
- *Author: ML Intern (E-Rong)*
 
23
 
24
  ### 1.1 Domain: Multi-Agent Bomberman RL
25
 
26
+ The TIL-26-AE challenge is a multi-agent Bomberman-like environment where agents navigate a grid, collect resources, place bombs, destroy walls, and eliminate opponents. The key challenge is **autonomous exploration**.
27
 
28
+ ### 1.2 Key Papers
29
 
30
  | Paper | arXiv ID | Key Insight | Relevance |
31
  |---|---|---|---|
32
+ | *Pommerman: A Multi-Agent Benchmark* | 2407.00662 | PettingZoo + parallel env standard | Confirmed approach |
33
+ | *MAPPO* | 2103.01955 | Shared parameters, curriculum | Justified curriculum |
34
+ | *Invalid Action Masking* | 2006.14171 | Masks logits before softmax | **Directly applicable** |
35
+ | *PPO Algorithms* | 1707.06347 | Clipped surrogate, stable | Chosen over DQN |
36
 
37
  ### 1.3 Why MaskablePPO?
38
 
39
+ Bomberman agents cannot move into walls, out of bounds, or place bombs without stockpile. The observation includes `action_mask: uint8[6]`. Standard PPO would waste ~30-40% of samples on illegal moves. MaskablePPO masks logits before softmax, ensuring only legal actions are sampled.
 
 
 
 
 
 
 
40
 
41
  ### 1.4 Why Curriculum Learning?
42
 
43
+ Training against strong opponents from scratch leads to catastrophic early losses (~0 reward). Curriculum learning (easy → hard) is standard in competitive multi-agent RL.
 
 
 
 
44
 
45
+ ### 1.5 Why Not DQN?
46
 
47
+ DQN struggles with action masking (requires custom architecture). PPO's on-policy updates handle non-stationarity of multi-agent self-play better, and has mature masking support in `sb3-contrib`.
 
 
 
 
48
 
49
  ---
50
 
 
52
 
53
  ### 2.1 Environment Structure
54
 
55
+ - **Grid size**: 16×16
 
 
56
  - **Agents**: Configurable (default 2 teams, Phase 3 uses 3)
57
+ - **Observations**: Dict with `agent_viewcone[7×5×25]`, `base_viewcone[5×5×25]`, direction, location, health, `action_mask[6]`, etc.
58
+ - **Actions**: Discrete(6) — FORWARD, BACKWARD, LEFT, RIGHT, STAY, PLACE_BOMB
59
+ - **Episode length**: ~200 steps
 
 
 
 
 
 
60
 
61
  ### 2.2 Observation Flattening
62
 
63
+ Flattened to **1511-dim vector**: agent_viewcone(875) + base_viewcone(625) + 11 scalars.
64
 
65
+ ### 2.3 Action Masking
 
 
 
 
 
 
 
 
 
 
 
 
 
 
66
 
67
+ Critical bug found: `Monitor` must wrap *outside* `ActionMasker`, not inside. Otherwise `get_action_masks()` fails because `Monitor` does not expose `action_masks()`.
 
 
 
 
 
 
 
 
 
 
68
 
69
  ---
70
 
71
  ## 3. Development Decisions
72
 
73
+ ### 3.1 Single-Agent Wrapper
 
 
74
 
75
+ Controls only `agent_0`; opponents use random (Phase 1-2) or rule-based (Phase 3) policies. Reduces to single-agent RL with non-stationary environment.
 
 
76
 
77
+ ### 3.2 3-Phase Curriculum
 
 
78
 
79
  | Phase | Opponent | Duration | Purpose |
80
  |---|---|---|---|
81
+ | **1** | Random | 500k | Learn movement, bombs, basics |
82
+ | **2** | Random + exploration bonus | 500k | Prevent camping exploit |
83
+ | **3** | Rule-based curriculum | 1M | Generalize to structured opponents |
 
 
 
 
84
 
85
+ ### 3.3 Philosophy
86
 
87
+ - `stable-baselines3` for PPO core
88
+ - `sb3-contrib` for MaskablePPO + ActionMasker
89
+ - `huggingface_hub` for persistent checkpoint storage
 
 
 
 
 
 
90
 
91
+ ### 3.4 Why Hub Every 50k Steps
92
 
93
+ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple times. Hub checkpointing saved the project at 400k steps when training crashed.
94
 
95
+ ---
 
 
96
 
97
+ ## 4. Training Phases
98
 
99
+ ### 4.1 Phase 1: Foundation (vs Random)
100
 
101
+ **Duration**: 500,352 steps
102
+ **Result**: Win rate 92%, avg reward 180.1, 100% survival
103
+ **Challenges**: Wrapper ordering, dependency issues, sandbox resets
 
104
 
105
+ ### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
106
 
107
+ **Status**: Started at 500352 steps, running on A10G at ~54 FPS
108
+ **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
109
+ **ETA**: ~2.5 hours, targets 1,000,352 total steps
110
+ **Purpose**: Force map exploration, prevent safe base-camping
111
 
112
+ ### 4.3 Phase 3: Curriculum Self-Play
113
 
114
+ **Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  ---
117
 
 
121
 
122
  | Metric | Value |
123
  |---|---|
124
+ | Timesteps | 500,352 |
125
+ | Final Reward | 237.0 |
126
+ | FPS | 52 (A10G) |
127
+ | Wall time | ~2h 15min |
128
+ | Win Rate (eval) | **92.0%** |
129
+ | Avg Reward (eval) | **180.1** |
130
+ | Survival Rate | **100.0%** |
131
 
132
+ ### 5.2 Phase 2 Interim (Early)
133
 
134
  | Metric | Value |
135
  |---|---|
136
+ | Starting Step | 500,352 |
137
+ | Initial Reward (shaped) | 210 |
138
+ | FPS | 54 |
139
+ | Explore Weight | Adaptive k=1.2 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
  ---
142
 
143
  ## 6. Artifacts
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  | File | Purpose |
146
  |---|---|
147
+ | `phase1_final.zip` | Trained model |
148
+ | `phase2_final.zip` | *(in progress)* |
149
+ | `ckpt_50000-400000.zip` | Phase 1 intermediates |
150
+ | `ae_manager.py` | Inference code |
151
+ | `docs/ae.md` | This documentation |
 
 
 
 
 
 
 
 
152
 
153
  ---
154
 
155
  ## 7. Next Steps
156
 
157
+ - **Phase 2**: Complete 500k exploration-shaping steps
158
+ - **Phase 3**: Curriculum vs rule-based opponents (1M steps)
159
+ - **Eval**: Multi-team evaluation vs smart opponents
160
+ - **Future**: CNN policy, opponent modeling, LSTM memory
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
+ *Last updated: 2026-05-14 — Phase 2 in progress*