=== Phase 1 Summary === Training: MaskablePPO vs Random Opponents Timesteps: 500,352 Final Training Reward: 237.0 Evaluation (100 episodes vs Random): === TIL-26-AE Phase 1 Evaluation Results === Model: phase1_final.zip (500k steps) Episodes: 100 Win Rate: 92.0% (92/100) Avg Reward: 180.1 Avg Episode Length: 200.0 Avg Bombs/Episode: 20.4 Survival Rate (198+ steps): 100.0% Checkpoints saved: ckpt_50000 to ckpt_400000 + phase1_final.zip