File size: 6,701 Bytes
20e7173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17efbc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20e7173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
# AGENTS.md β€” Context & Lessons for Future Sessions

> This file exists because sandboxes reset and I (the agent) lose all memory. 
> **READ THIS FIRST** before doing anything on this project.

---

## What This Project Is

- **Challenge**: TIL-26-AE (The Intelligent League β€” Automated Exploration)
- **Game**: Multi-agent Bomberman on a 16Γ—16 grid
- **My Role**: Train `agent_0` via RL to compete autonomously
- **Main Repo**: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs)
- **Space**: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`)
- **TIL Source**: Private Space `e-rong/til-26-ae` β€” contains `til_environment/` module

---

## CRITICAL: What Killed Training & Cost Money

### ❌ NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES

Sandboxes are **interactive dev environments**. They:
- Recycle after inactivity / timeout
- Kill processes silently
- **Keep billing you while empty** after the process dies

**Damage done**: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued.

### βœ… ALWAYS USE HF JOBS FOR BATCH TRAINING

- Persistent GPU allocation
- Runs until completion (or your timeout)
- Fails visibly if something breaks (no silent empty billing)
- Must set `namespace="E-Rong"` to bill the org, not the user

### ❌ NEVER `git clone` A PRIVATE REPO IN AN HF JOB

`git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`.

**Use instead**:
```python
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='e-rong/til-26-ae',
    repo_type='space',
    local_dir='/app/til-26-ae-repo'
)
```
`snapshot_download` auto-uses the `HF_TOKEN` env var.

### βœ… ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN

Submit a 5-minute job that:
1. Downloads the TIL repo
2. Installs deps
3. Runs 100 training steps
4. Saves a dummy checkpoint to the Hub

Only after this succeeds, submit the multi-hour job.

---

## Session Startup Checklist

Before doing **anything** on this project:

1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent`
2. [ ] Read this file (`AGENTS.md`)
3. [ ] Check latest checkpoint on Hub (sort `phase*_ckpt_*.zip` files)
4. [ ] Determine current phase and remaining steps
5. [ ] If training needed: write script to sandbox, **smoke-test in HF Job first**

---

## Technical Decisions That Work

### MaskablePPO + Action Masking
- `sb3_contrib.MaskablePPO` with `ActionMasker`
- Bomberman has `action_mask: uint8[6]` β€” walls/edges make moves illegal
- Standard PPO wastes ~30-40% samples on illegal actions early on
- **Papers**: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171)

### Observation Flattening
1511-dim vector from dict observation:
```
agent_viewcone:  7Γ—5Γ—25 = 875
base_viewcone:   5Γ—5Γ—25 = 625
direction, location[2], base_location[2], health, frozen_ticks,
base_health, team_resources, team_bombs, step = 11 scalars
Total: 1511
```

### Wrapper Order (CRITICAL)
```python
# CORRECT
env = ActionMasker(base_env, lambda e: e.action_masks())
env = Monitor(env)

# WRONG β€” Monitor blocks action_masks() exposure
env = ActionMasker(Monitor(base_env), ...)  # DON'T DO THIS
```

### 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| 1 | Random | 500k | Learn basics |
| 2 | Random + visit-count shaping | 500k | Prevent camping |
| 3 | Rule-based curriculum | 1M | Generalize to structured opponents |

### Checkpointing Every 50k Steps
- Local + Hub push via `HfApi.upload_file()`
- Saved the project when sandboxes reset at 400k and 600k steps

---

## Technical Decisions That Failed

| Decision | Why It Failed | Fix |
|---|---|---|
| Training in sandboxes | Process died, empty sandbox kept billing | Use HF Jobs |
| `git clone` in HF Job | No auth for private repo | `snapshot_download` |
| Inline 20KB script in `hf_jobs.script` | Delivery mechanism choked | Write to sandbox file first, submit path |
| No session state on Hub | Lost track of progress across resets | `session_state.json` + this file |
| `Monitor` inside `ActionMasker` | `get_action_masks()` failed | `ActionMasker` β†’ `Monitor` order |

---

## Cost Awareness

| Hardware | $/hr | Good For |
|---|---|---|
| `cpu-basic` | ~$0.05 | Writing scripts, reading files, small tests |
| `t4-small` | ~$0.40 | Short dev, NOT training |
| `a10g-small` | ~$1.00 | Training, but use HF Jobs not sandboxes |
| `a10g-large` | ~$2.00 | Larger batch sizes, not needed for this project |

**Rule**: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only.

---

## Sandbox Policy (User Mandate)

> **From this point forward, the user has mandated:**

1. **Start `cpu-basic` sandbox** at the beginning of every session
2. **Use `cpu-basic` for**: context, writing code, writing docs, editing files, planning
3. **Only switch to GPU sandbox** (`t4-small` or `a10g-small`) when performing **smoke tests** for training scripts
4. **Stop GPU sandbox IMMEDIATELY** after the smoke test completes
5. **Training tasks ONLY as HF Jobs** β€” never leave a training process running in a sandbox
6. **Never leave a GPU sandbox running idle** β€” this wastes money

**Why this matters**: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute.

---

## Repo File Guide

| File | What It Is |
|---|---|
| `session_state.json` | Current phase, checkpoint, mistakes log, next steps |
| `AGENTS.md` | This file β€” lessons and context |
| `docs/ae.md` | Full project documentation (research, design, results) |
| `phase1_final.zip` | Complete Phase 1 model |
| `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
| `phase2_final.zip` | Phase 2 complete model (when done) |
| `ae_manager.py` | Inference code for the evaluation server |
| `phase2_job.py` | Latest HF Job script (may need fixes) |
| `train_all_phases.py` | Original training script |

---

## Current State (Last Updated: 2026-05-14)

- **Phase**: 2 (in progress, crashed)
- **Latest checkpoint**: `phase2_ckpt_600352.zip`
- **Target**: 1,000,352 total steps (400,000 remaining)
- **Blocker**: Need to resume via HF Job with `snapshot_download` auth
- **Smoke test needed**: YES β€” before any multi-hour job

---

## Contact / Ownership

- **User**: E-Rong
- **Org**: E-Rong
- **Billing**: Must specify `namespace="E-Rong"` on all HF Jobs
- **I am**: An ephemeral agent with no persistent memory. I depend on this file.

---

*If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.*