E-Rong
/

til-26-ae-agent

ml-intern

Model card Files Files and versions

xet

Community

E-Rong commited on about 3 hours ago

Commit

47c41e4

verified ·

1 Parent(s): 9da3da4

Update docs: Phase 2 complete, Phase 3 ready

Browse files

Files changed (1) hide show

docs/ae.md +27 -19

docs/ae.md CHANGED Viewed

@@ -102,16 +102,21 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
 **Result**: Win rate 92%, avg reward 180.1, 100% survival
 **Challenges**: Wrapper ordering, dependency issues, sandbox resets
-### 4.2 Phase 2: Exploration Shaping (IN PROGRESS)
-**Status**: Started at 500352 steps, running on A10G at ~54 FPS
 **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
-**ETA**: ~2.5 hours, targets 1,000,352 total steps
-**Purpose**: Force map exploration, prevent safe base-camping
-### 4.3 Phase 3: Curriculum Self-Play
-**Pending**: Rule-based static → simple → smart → mixed, 3 teams, 1M steps
 ---
@@ -129,14 +134,16 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
 | Avg Reward (eval) | **180.1** |
 | Survival Rate | **100.0%** |
-### 5.2 Phase 2 Interim (Early)
 | Metric | Value |
 |---|---|
-| Starting Step | 500,352 |
-| Initial Reward (shaped) | 210 |
-| FPS | 54 |
-| Explore Weight | Adaptive k=1.2 |
 ---
@@ -144,9 +151,10 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
 | File | Purpose |
 |---|---|
-| `phase1_final.zip` | Trained model |
-| `phase2_final.zip` | *(in progress)* |
-| `ckpt_50000-400000.zip` | Phase 1 intermediates |
 | `ae_manager.py` | Inference code |
 | `docs/ae.md` | This documentation |
@@ -154,9 +162,9 @@ Sandbox resets (T4 container recycling) caused local `/app/data/` loss multiple
 ## 7. Next Steps
-- **Phase 2**: Complete 500k exploration-shaping steps
-- **Phase 3**: Curriculum vs rule-based opponents (1M steps)
-- **Eval**: Multi-team evaluation vs smart opponents
-- **Future**: CNN policy, opponent modeling, LSTM memory
-*Last updated: 2026-05-14 — Phase 2 in progress*

 **Result**: Win rate 92%, avg reward 180.1, 100% survival
 **Challenges**: Wrapper ordering, dependency issues, sandbox resets
+### 4.2 Phase 2: Exploration Shaping (COMPLETE)
+**Duration**: 500,408 additional steps (600,352 → 1,001,760)
 **Mechanism**: Visit-count bonus = 1/(1+visits), adaptive annealing via tanh(avg_enemy_deaths)
+**Hardware**: A10G, ~50 FPS
+**Wall time**: ~2h 45min
+**Result**: Win rate 93.0%, avg reward 153.4, avg bombs 20.1
+**Key insight**: Reward decreased (180→153) but win rate increased (92%→93%), confirming exploration makes the policy more robust at the cost of safe base-camping reward.
+### 4.3 Phase 3: Curriculum Self-Play (PENDING)
+**Script**: `phase3_curriculum.py` (ready on Hub)
+**Plan**: 5-stage rule-based curriculum — static → random → simple_bomb → evasive → mixed
+**Duration**: 1M steps
+**Advancement gate**: >55% win rate per stage
 ---
 | Avg Reward (eval) | **180.1** |
 | Survival Rate | **100.0%** |
+### 5.2 Phase 2 Results
 | Metric | Value |
 |---|---|
+| Timesteps | 1,001,760 total (500,408 new) |
+| FPS | 50 (A10G) |
+| Wall time | ~2h 45min |
+| Win Rate (eval) | **93.0%** |
+| Avg Reward (eval) | **153.4** |
+| Avg Bombs | **20.1** |
 ---
 | File | Purpose |
 |---|---|
+| `phase1_final.zip` | Phase 1 complete checkpoint |
+| `phase2_final.zip` | Phase 2 complete checkpoint |
+| `phase2_ckpt_*.zip` | Phase 2 intermediates (650k–1M) |
+| `phase2_eval_results.txt` | Phase 2 evaluation metrics |
 | `ae_manager.py` | Inference code |
 | `docs/ae.md` | This documentation |
 ## 7. Next Steps
+- [ ] Submit Phase 3 HF Job (`phase3_curriculum.py`)
+- [ ] Monitor 5-stage curriculum progression
+- [ ] Evaluate final model vs mixed rule-based opponents
+- [ ] Future: CNN policy, opponent modeling, LSTM memory
+*Last updated: 2026-05-14 — Phase 2 complete, Phase 3 ready*