File size: 9,490 Bytes
20e7173 17efbc9 4b6177e 3745a2d 4b6177e 3745a2d 4b6177e 3745a2d 4b6177e 3745a2d 4b6177e 3745a2d 4b6177e 3745a2d 4b6177e 20e7173 3745a2d 4b6177e 20e7173 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | # AGENTS.md β Context & Lessons for Future Sessions
> This file exists because sandboxes reset and I (the agent) lose all memory.
> **READ THIS FIRST** before doing anything on this project.
---
## What This Project Is
- **Challenge**: TIL-26-AE (The Intelligent League β Automated Exploration)
- **Game**: Multi-agent Bomberman on a 16Γ16 grid
- **My Role**: Train `agent_0` via RL to compete autonomously
- **Main Repo**: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs)
- **Space**: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`)
- **TIL Source**: Private Space `e-rong/til-26-ae` β contains `til_environment/` module
---
## CRITICAL: What Killed Training & Cost Money
### β NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES
Sandboxes are **interactive dev environments**. They:
- Recycle after inactivity / timeout
- Kill processes silently
- **Keep billing you while empty** after the process dies
**Damage done**: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued.
### β
ALWAYS USE HF JOBS FOR BATCH TRAINING
- Persistent GPU allocation
- Runs until completion (or your timeout)
- Fails visibly if something breaks (no silent empty billing)
- Must set `namespace="E-Rong"` to bill the org, not the user
### β NEVER `git clone` A PRIVATE REPO IN AN HF JOB
`git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`.
**Use instead**:
```python
from huggingface_hub import snapshot_download
snapshot_download(
repo_id='e-rong/til-26-ae',
repo_type='space',
local_dir='/app/til-26-ae-repo'
)
```
`snapshot_download` auto-uses the `HF_TOKEN` env var.
### β
ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN
Submit a 5-minute job that:
1. Downloads the TIL repo
2. Installs deps
3. Runs 100 training steps
4. Saves a dummy checkpoint to the Hub
Only after this succeeds, submit the multi-hour job.
---
## Session Startup Checklist
Before doing **anything** on this project:
1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent`
2. [ ] Read this file (`AGENTS.md`)
3. [ ] Check latest checkpoint on Hub (sort `phase*_ckpt_*.zip` files)
4. [ ] Determine current phase and remaining steps
5. [ ] If training needed: write script to sandbox, **smoke-test in HF Job first**
---
## Technical Decisions That Work
### MaskablePPO + Action Masking
- `sb3_contrib.MaskablePPO` with `ActionMasker`
- Bomberman has `action_mask: uint8[6]` β walls/edges make moves illegal
- Standard PPO wastes ~30-40% samples on illegal actions early on
- **Papers**: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171)
### Observation Flattening
1511-dim vector from dict observation:
```
agent_viewcone: 7Γ5Γ25 = 875
base_viewcone: 5Γ5Γ25 = 625
direction, location[2], base_location[2], health, frozen_ticks,
base_health, team_resources, team_bombs, step = 11 scalars
Total: 1511
```
### Wrapper Order (CRITICAL)
```python
# CORRECT
env = ActionMasker(base_env, lambda e: e.action_masks())
env = Monitor(env)
# WRONG β Monitor blocks action_masks() exposure
env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
```
### 3-Phase Curriculum
| Phase | Opponent | Duration | Purpose |
|---|---|---|---|
| 1 | Random | 500k | Learn basics |
| 2 | Random + visit-count shaping | 500k | Prevent camping |
| 3 | Rule-based curriculum | 1M | Generalize to structured opponents |
### Checkpointing Every 50k Steps
- Local + Hub push via `HfApi.upload_file()`
- Saved the project when sandboxes reset at 400k and 600k steps
---
## Technical Decisions That Failed
| Decision | Why It Failed | Fix |
|---|---|---|
| Training in sandboxes | Process died, empty sandbox kept billing | Use HF Jobs |
| `git clone` in HF Job | No auth for private repo | `snapshot_download` |
| Inline 20KB script in `hf_jobs.script` | Delivery mechanism choked | Write to sandbox file first, submit path |
| No session state on Hub | Lost track of progress across resets | `session_state.json` + this file |
| `Monitor` inside `ActionMasker` | `get_action_masks()` failed | `ActionMasker` β `Monitor` order |
---
## Cost Awareness
| Hardware | $/hr | Good For |
|---|---|---|
| `cpu-basic` | ~$0.05 | Writing scripts, reading files, small tests |
| `t4-small` | ~$0.40 | Short dev, NOT training |
| `a10g-small` | ~$1.00 | Training, but use HF Jobs not sandboxes |
| `a10g-large` | ~$2.00 | Larger batch sizes, not needed for this project |
**Rule**: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only.
---
## Sandbox Policy (User Mandate)
> **From this point forward, the user has mandated:**
1. **Start `cpu-basic` sandbox** at the beginning of every session
2. **Use `cpu-basic` for**: context, writing code, writing docs, editing files, planning
3. **Only switch to GPU sandbox** (`t4-small` or `a10g-small`) when performing **smoke tests** for training scripts
4. **Stop GPU sandbox IMMEDIATELY** after the smoke test completes
5. **Training tasks ONLY as HF Jobs** β never leave a training process running in a sandbox
6. **Never leave a GPU sandbox running idle** β this wastes money
**Why this matters**: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute.
---
## How to Submit HF Jobs Correctly (Research Results)
### Based on `huggingface.co/docs/hub/jobs-quickstart`:
**DO NOT use `git clone` for private repos.**
```python
# WRONG β
import subprocess
subprocess.run(["git", "clone", "https://huggingface.co/spaces/e-rong/til-26-ae"])
# Fails: git does not read HF_TOKEN env var
# CORRECT β
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="e-rong/til-26-ae",
repo_type="space",
local_dir="/app/til-26-ae-repo"
)
# snapshot_download auto-uses HF_TOKEN from environment
```
### Script Submission Pattern (What Actually Works)
**β οΈ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.**
When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to:
```
https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py
```
and runs it via `uv run <url>`. **This means the file MUST already exist on the Hub repo.**
**The correct workflow is:**
```python
from tools import write, hf_repo_files, hf_jobs
# Step 1: Write script to sandbox file
write(path="/app/train.py", content="...")
# Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible
hf_repo_files(
operation="upload",
repo_id="E-Rong/til-26-ae-agent",
path="train.py",
content=open("/app/train.py").read()
)
# Step 3: Submit job referencing the sandbox path
# The job system will convert this to a Hub raw URL under the hood
hf_jobs(
operation="run",
script="/app/train.py", # β sandbox file path
dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
"numpy", "huggingface_hub", "pygame", "omegaconf",
"mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
hardware_flavor="a10g-small",
timeout="6h",
namespace="E-Rong" # β bills to org
)
```
**Verification from `hf_jobs inspect`:**
```bash
exec uv run --with torch --with sb3-contrib ... \
https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py
```
The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path.
**Why this matters**: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent.
### Job Persistence
- Jobs run on HF infrastructure, not in your sandbox
- The sandbox can die β the job keeps running
- Check logs with `hf_jobs(operation="logs", job_id="...")`
- Job storage is ephemeral β **push checkpoints to Hub** (not just local)
---
## Repo File Guide
| File | What It Is |
|---|---|
| `session_state.json` | Current phase, checkpoint, mistakes log, next steps |
| `AGENTS.md` | This file β lessons and context |
| `docs/ae.md` | Full project documentation (research, design, results) |
| `phase1_final.zip` | Complete Phase 1 model |
| `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
| `phase2_final.zip` | Phase 2 complete model (when done) |
| `ae_manager.py` | Inference code for the evaluation server |
| `phase2_resume.py` | Latest HF Job script (works β uses snapshot_download) |
| `smoke_test.py` | 5-minute validation job β test before any real job |
| `train_all_phases.py` | Original training script |
---
## Current State (Last Updated: 2026-05-14)
- **Phase**: 2 (in progress, crashed)
- **Latest checkpoint**: `phase2_ckpt_600352.zip`
- **Target**: 1,000,352 total steps (400,000 remaining)
- **Blocker**: Need to resume via HF Job with `snapshot_download` auth
- **Smoke test needed**: YES β before any multi-hour job
---
## Contact / Ownership
- **User**: E-Rong
- **Org**: E-Rong
- **Billing**: Must specify `namespace="E-Rong"` on all HF Jobs
- **I am**: An ephemeral agent with no persistent memory. I depend on this file.
---
*If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.*
|