Update docs: Phase 2 complete, Phase 3 ready
Browse files- docs/ae.md +27 -19
docs/ae.md
CHANGED
|
@@ -102,16 +102,21 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
|
|
| 102 |
**Result**: Win rate 92%, avg reward 180.1, 100% survival
|
| 103 |
**Challenges**: Wrapper ordering, dependency issues, sandbox resets
|
| 104 |
|
| 105 |
-
### 4.2 Phase 2: Exploration Shaping (
|
| 106 |
|
| 107 |
-
**
|
| 108 |
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
|
| 109 |
-
**
|
| 110 |
-
**
|
|
|
|
|
|
|
| 111 |
|
| 112 |
-
### 4.3 Phase 3: Curriculum Self-Play
|
| 113 |
|
| 114 |
-
**
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
---
|
| 117 |
|
|
@@ -129,14 +134,16 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
|
|
| 129 |
| Avg Reward (eval) | **180.1** |
|
| 130 |
| Survival Rate | **100.0%** |
|
| 131 |
|
| 132 |
-
### 5.2 Phase 2
|
| 133 |
|
| 134 |
| Metric | Value |
|
| 135 |
|---|---|
|
| 136 |
-
|
|
| 137 |
-
|
|
| 138 |
-
|
|
| 139 |
-
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
---
|
| 142 |
|
|
@@ -144,9 +151,10 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
|
|
| 144 |
|
| 145 |
| File | Purpose |
|
| 146 |
|---|---|
|
| 147 |
-
| `phase1_final.zip` |
|
| 148 |
-
| `phase2_final.zip` |
|
| 149 |
-
| `
|
|
|
|
| 150 |
| `ae_manager.py` | Inference code |
|
| 151 |
| `docs/ae.md` | This documentation |
|
| 152 |
|
|
@@ -154,9 +162,9 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
|
|
| 154 |
|
| 155 |
## 7. Next Steps
|
| 156 |
|
| 157 |
-
-
|
| 158 |
-
-
|
| 159 |
-
-
|
| 160 |
-
-
|
| 161 |
|
| 162 |
-
*Last updated: 2026-05-14 β Phase 2
|
|
|
|
| 102 |
**Result**: Win rate 92%, avg reward 180.1, 100% survival
|
| 103 |
**Challenges**: Wrapper ordering, dependency issues, sandbox resets
|
| 104 |
|
| 105 |
+
### 4.2 Phase 2: Exploration Shaping (COMPLETE)
|
| 106 |
|
| 107 |
+
**Duration**: 500,408 additional steps (600,352 β 1,001,760)
|
| 108 |
**Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
|
| 109 |
+
**Hardware**: A10G, ~50 FPS
|
| 110 |
+
**Wall time**: ~2h 45min
|
| 111 |
+
**Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
|
| 112 |
+
**Key insight**: Reward decreased (180β153) but win rate increased (92%β93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.
|
| 113 |
|
| 114 |
+
### 4.3 Phase 3: Curriculum Self-Play (PENDING)
|
| 115 |
|
| 116 |
+
**Script**: `phase3_curriculum.py` (ready on Hub)
|
| 117 |
+
**Plan**: 5-stage rule-based curriculum β static β random β simple_bomb β evasive β mixed
|
| 118 |
+
**Duration**: 1M steps
|
| 119 |
+
**Advancement gate**: >55% win rate per stage
|
| 120 |
|
| 121 |
---
|
| 122 |
|
|
|
|
| 134 |
| Avg Reward (eval) | **180.1** |
|
| 135 |
| Survival Rate | **100.0%** |
|
| 136 |
|
| 137 |
+
### 5.2 Phase 2 Results
|
| 138 |
|
| 139 |
| Metric | Value |
|
| 140 |
|---|---|
|
| 141 |
+
| Timesteps | 1,001,760 total (500,408 new) |
|
| 142 |
+
| FPS | 50 (A10G) |
|
| 143 |
+
| Wall time | ~2h 45min |
|
| 144 |
+
| Win Rate (eval) | **93.0%** |
|
| 145 |
+
| Avg Reward (eval) | **153.4** |
|
| 146 |
+
| Avg Bombs | **20.1** |
|
| 147 |
|
| 148 |
---
|
| 149 |
|
|
|
|
| 151 |
|
| 152 |
| File | Purpose |
|
| 153 |
|---|---|
|
| 154 |
+
| `phase1_final.zip` | Phase 1 complete checkpoint |
|
| 155 |
+
| `phase2_final.zip` | Phase 2 complete checkpoint |
|
| 156 |
+
| `phase2_ckpt_*.zip` | Phase 2 intermediates (650kβ1M) |
|
| 157 |
+
| `phase2_eval_results.txt` | Phase 2 evaluation metrics |
|
| 158 |
| `ae_manager.py` | Inference code |
|
| 159 |
| `docs/ae.md` | This documentation |
|
| 160 |
|
|
|
|
| 162 |
|
| 163 |
## 7. Next Steps
|
| 164 |
|
| 165 |
+
- [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
|
| 166 |
+
- [ ] Monitor 5-stage curriculum progression
|
| 167 |
+
- [ ] Evaluate final model vs mixed rule-based opponents
|
| 168 |
+
- [ ] Future: CNN policy, opponent modeling, LSTM memory
|
| 169 |
|
| 170 |
+
*Last updated: 2026-05-14 β Phase 2 complete, Phase 3 ready*
|