File size: 9,490 Bytes
20e7173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17efbc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4b6177e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3745a2d
 
 
 
 
 
 
 
 
 
 
4b6177e
 
3745a2d
 
 
4b6177e
 
3745a2d
 
 
 
 
 
 
 
 
 
4b6177e
 
3745a2d
4b6177e
 
 
 
 
3745a2d
4b6177e
 
 
3745a2d
 
 
 
 
 
 
 
4b6177e
 
 
 
 
 
 
 
 
20e7173
 
 
 
 
 
 
 
 
 
 
3745a2d
4b6177e
20e7173
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
# AGENTS.md β€” Context & Lessons for Future Sessions

> This file exists because sandboxes reset and I (the agent) lose all memory. 
> **READ THIS FIRST** before doing anything on this project.

---

## What This Project Is

- **Challenge**: TIL-26-AE (The Intelligent League β€” Automated Exploration)
- **Game**: Multi-agent Bomberman on a 16Γ—16 grid
- **My Role**: Train `agent_0` via RL to compete autonomously
- **Main Repo**: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs)
- **Space**: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`)
- **TIL Source**: Private Space `e-rong/til-26-ae` β€” contains `til_environment/` module

---

## CRITICAL: What Killed Training & Cost Money

### ❌ NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES

Sandboxes are **interactive dev environments**. They:
- Recycle after inactivity / timeout
- Kill processes silently
- **Keep billing you while empty** after the process dies

**Damage done**: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued.

### βœ… ALWAYS USE HF JOBS FOR BATCH TRAINING

- Persistent GPU allocation
- Runs until completion (or your timeout)
- Fails visibly if something breaks (no silent empty billing)
- Must set `namespace="E-Rong"` to bill the org, not the user

### ❌ NEVER `git clone` A PRIVATE REPO IN AN HF JOB

`git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`.

**Use instead**:
```python
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id='e-rong/til-26-ae',
    repo_type='space',
    local_dir='/app/til-26-ae-repo'
)
```
`snapshot_download` auto-uses the `HF_TOKEN` env var.

### βœ… ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN

Submit a 5-minute job that:
1. Downloads the TIL repo
2. Installs deps
3. Runs 100 training steps
4. Saves a dummy checkpoint to the Hub

Only after this succeeds, submit the multi-hour job.

---

## Session Startup Checklist

Before doing **anything** on this project:

1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent`
2. [ ] Read this file (`AGENTS.md`)
3. [ ] Check latest checkpoint on Hub (sort `phase*_ckpt_*.zip` files)
4. [ ] Determine current phase and remaining steps
5. [ ] If training needed: write script to sandbox, **smoke-test in HF Job first**

---

## Technical Decisions That Work

### MaskablePPO + Action Masking
- `sb3_contrib.MaskablePPO` with `ActionMasker`
- Bomberman has `action_mask: uint8[6]` β€” walls/edges make moves illegal
- Standard PPO wastes ~30-40% samples on illegal actions early on
- **Papers**: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171)

### Observation Flattening
1511-dim vector from dict observation:
```
agent_viewcone:  7Γ—5Γ—25 = 875
base_viewcone:   5Γ—5Γ—25 = 625
direction, location[2], base_location[2], health, frozen_ticks,
base_health, team_resources, team_bombs, step = 11 scalars
Total: 1511
```

### Wrapper Order (CRITICAL)
```python
# CORRECT
env = ActionMasker(base_env, lambda e: e.action_masks())
env = Monitor(env)

# WRONG β€” Monitor blocks action_masks() exposure
env = ActionMasker(Monitor(base_env), ...)  # DON'T DO THIS
```

### 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| 1 | Random | 500k | Learn basics |
| 2 | Random + visit-count shaping | 500k | Prevent camping |
| 3 | Rule-based curriculum | 1M | Generalize to structured opponents |

### Checkpointing Every 50k Steps
- Local + Hub push via `HfApi.upload_file()`
- Saved the project when sandboxes reset at 400k and 600k steps

---

## Technical Decisions That Failed

| Decision | Why It Failed | Fix |
|---|---|---|
| Training in sandboxes | Process died, empty sandbox kept billing | Use HF Jobs |
| `git clone` in HF Job | No auth for private repo | `snapshot_download` |
| Inline 20KB script in `hf_jobs.script` | Delivery mechanism choked | Write to sandbox file first, submit path |
| No session state on Hub | Lost track of progress across resets | `session_state.json` + this file |
| `Monitor` inside `ActionMasker` | `get_action_masks()` failed | `ActionMasker` β†’ `Monitor` order |

---

## Cost Awareness

| Hardware | $/hr | Good For |
|---|---|---|
| `cpu-basic` | ~$0.05 | Writing scripts, reading files, small tests |
| `t4-small` | ~$0.40 | Short dev, NOT training |
| `a10g-small` | ~$1.00 | Training, but use HF Jobs not sandboxes |
| `a10g-large` | ~$2.00 | Larger batch sizes, not needed for this project |

**Rule**: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only.

---

## Sandbox Policy (User Mandate)

> **From this point forward, the user has mandated:**

1. **Start `cpu-basic` sandbox** at the beginning of every session
2. **Use `cpu-basic` for**: context, writing code, writing docs, editing files, planning
3. **Only switch to GPU sandbox** (`t4-small` or `a10g-small`) when performing **smoke tests** for training scripts
4. **Stop GPU sandbox IMMEDIATELY** after the smoke test completes
5. **Training tasks ONLY as HF Jobs** β€” never leave a training process running in a sandbox
6. **Never leave a GPU sandbox running idle** β€” this wastes money

**Why this matters**: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute.

---

## How to Submit HF Jobs Correctly (Research Results)

### Based on `huggingface.co/docs/hub/jobs-quickstart`:

**DO NOT use `git clone` for private repos.**

```python
# WRONG ❌
import subprocess
subprocess.run(["git", "clone", "https://huggingface.co/spaces/e-rong/til-26-ae"])
# Fails: git does not read HF_TOKEN env var

# CORRECT βœ…
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id="e-rong/til-26-ae",
    repo_type="space",
    local_dir="/app/til-26-ae-repo"
)
# snapshot_download auto-uses HF_TOKEN from environment
```

### Script Submission Pattern (What Actually Works)

**⚠️ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.**

When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to:
```
https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py
```
and runs it via `uv run <url>`. **This means the file MUST already exist on the Hub repo.**

**The correct workflow is:**

```python
from tools import write, hf_repo_files, hf_jobs

# Step 1: Write script to sandbox file
write(path="/app/train.py", content="...")

# Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible
hf_repo_files(
    operation="upload",
    repo_id="E-Rong/til-26-ae-agent",
    path="train.py",
    content=open("/app/train.py").read()
)

# Step 3: Submit job referencing the sandbox path
# The job system will convert this to a Hub raw URL under the hood
hf_jobs(
    operation="run",
    script="/app/train.py",           # ← sandbox file path
    dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
                  "numpy", "huggingface_hub", "pygame", "omegaconf",
                  "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
    hardware_flavor="a10g-small",
    timeout="6h",
    namespace="E-Rong"              # ← bills to org
)
```

**Verification from `hf_jobs inspect`:**
```bash
exec uv run --with torch --with sb3-contrib ... \
    https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py
```
The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path.

**Why this matters**: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent.

### Job Persistence
- Jobs run on HF infrastructure, not in your sandbox
- The sandbox can die β€” the job keeps running
- Check logs with `hf_jobs(operation="logs", job_id="...")`
- Job storage is ephemeral β€” **push checkpoints to Hub** (not just local)

---

## Repo File Guide

| File | What It Is |
|---|---|
| `session_state.json` | Current phase, checkpoint, mistakes log, next steps |
| `AGENTS.md` | This file β€” lessons and context |
| `docs/ae.md` | Full project documentation (research, design, results) |
| `phase1_final.zip` | Complete Phase 1 model |
| `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
| `phase2_final.zip` | Phase 2 complete model (when done) |
| `ae_manager.py` | Inference code for the evaluation server |
| `phase2_resume.py` | Latest HF Job script (works β€” uses snapshot_download) |
| `smoke_test.py` | 5-minute validation job β€” test before any real job |
| `train_all_phases.py` | Original training script |

---

## Current State (Last Updated: 2026-05-14)

- **Phase**: 2 (in progress, crashed)
- **Latest checkpoint**: `phase2_ckpt_600352.zip`
- **Target**: 1,000,352 total steps (400,000 remaining)
- **Blocker**: Need to resume via HF Job with `snapshot_download` auth
- **Smoke test needed**: YES β€” before any multi-hour job

---

## Contact / Ownership

- **User**: E-Rong
- **Org**: E-Rong
- **Billing**: Must specify `namespace="E-Rong"` on all HF Jobs
- **I am**: An ephemeral agent with no persistent memory. I depend on this file.

---

*If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.*