E-Rong commited on
Commit
47c41e4
Β·
verified Β·
1 Parent(s): 9da3da4

Update docs: Phase 2 complete, Phase 3 ready

Browse files
Files changed (1) hide show
  1. docs/ae.md +27 -19
docs/ae.md CHANGED
@@ -102,16 +102,21 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
102
  **Result**: Win rate 92%, avg reward 180.1, 100% survival
103
  **Challenges**: Wrapper ordering, dependency issues, sandbox resets
104
 
105
- ### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
106
 
107
- **Status**: Started at 500352 steps, running on A10G at ~54 FPS
108
  **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
109
- **ETA**: ~2.5 hours, targets 1,000,352 total steps
110
- **Purpose**: Force map exploration, prevent safe base-camping
 
 
111
 
112
- ### 4.3 Phase 3: Curriculum Self-Play
113
 
114
- **Pending**: Rule-based static β†’ simple β†’ smart β†’ mixed, 3 teams, 1M steps
 
 
 
115
 
116
  ---
117
 
@@ -129,14 +134,16 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
129
  | Avg Reward (eval) | **180.1** |
130
  | Survival Rate | **100.0%** |
131
 
132
- ### 5.2 Phase 2 Interim (Early)
133
 
134
  | Metric | Value |
135
  |---|---|
136
- | Starting Step | 500,352 |
137
- | Initial Reward (shaped) | 210 |
138
- | FPS | 54 |
139
- | Explore Weight | Adaptive k=1.2 |
 
 
140
 
141
  ---
142
 
@@ -144,9 +151,10 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
144
 
145
  | File | Purpose |
146
  |---|---|
147
- | `phase1_final.zip` | Trained model |
148
- | `phase2_final.zip` | *(in progress)* |
149
- | `ckpt_50000-400000.zip` | Phase 1 intermediates |
 
150
  | `ae_manager.py` | Inference code |
151
  | `docs/ae.md` | This documentation |
152
 
@@ -154,9 +162,9 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
154
 
155
  ## 7. Next Steps
156
 
157
- - **Phase 2**: Complete 500k exploration-shaping steps
158
- - **Phase 3**: Curriculum vs rule-based opponents (1M steps)
159
- - **Eval**: Multi-team evaluation vs smart opponents
160
- - **Future**: CNN policy, opponent modeling, LSTM memory
161
 
162
- *Last updated: 2026-05-14 β€” Phase 2 in progress*
 
102
  **Result**: Win rate 92%, avg reward 180.1, 100% survival
103
  **Challenges**: Wrapper ordering, dependency issues, sandbox resets
104
 
105
+ ### 4.2 Phase 2: Exploration Shaping (COMPLETE)
106
 
107
+ **Duration**: 500,408 additional steps (600,352 β†’ 1,001,760)
108
  **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
109
+ **Hardware**: A10G, ~50 FPS
110
+ **Wall time**: ~2h 45min
111
+ **Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
112
+ **Key insight**: Reward decreased (180β†’153) but win rate increased (92%β†’93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.
113
 
114
+ ### 4.3 Phase 3: Curriculum Self-Play (PENDING)
115
 
116
+ **Script**: `phase3_curriculum.py` (ready on Hub)
117
+ **Plan**: 5-stage rule-based curriculum β€” static β†’ random β†’ simple_bomb β†’ evasive β†’ mixed
118
+ **Duration**: 1M steps
119
+ **Advancement gate**: >55% win rate per stage
120
 
121
  ---
122
 
 
134
  | Avg Reward (eval) | **180.1** |
135
  | Survival Rate | **100.0%** |
136
 
137
+ ### 5.2 Phase 2 Results
138
 
139
  | Metric | Value |
140
  |---|---|
141
+ | Timesteps | 1,001,760 total (500,408 new) |
142
+ | FPS | 50 (A10G) |
143
+ | Wall time | ~2h 45min |
144
+ | Win Rate (eval) | **93.0%** |
145
+ | Avg Reward (eval) | **153.4** |
146
+ | Avg Bombs | **20.1** |
147
 
148
  ---
149
 
 
151
 
152
  | File | Purpose |
153
  |---|---|
154
+ | `phase1_final.zip` | Phase 1 complete checkpoint |
155
+ | `phase2_final.zip` | Phase 2 complete checkpoint |
156
+ | `phase2_ckpt_*.zip` | Phase 2 intermediates (650k–1M) |
157
+ | `phase2_eval_results.txt` | Phase 2 evaluation metrics |
158
  | `ae_manager.py` | Inference code |
159
  | `docs/ae.md` | This documentation |
160
 
 
162
 
163
  ## 7. Next Steps
164
 
165
+ - [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
166
+ - [ ] Monitor 5-stage curriculum progression
167
+ - [ ] Evaluate final model vs mixed rule-based opponents
168
+ - [ ] Future: CNN policy, opponent modeling, LSTM memory
169
 
170
+ *Last updated: 2026-05-14 β€” Phase 2 complete, Phase 3 ready*