Add Sandbox Policy section per user mandate

17efbc9 verified 5 days ago

6.7 kB

	# AGENTS.md — Context & Lessons for Future Sessions

	> This file exists because sandboxes reset and I (the agent) lose all memory.
	> READ THIS FIRST before doing anything on this project.

	---

	## What This Project Is

	- Challenge: TIL-26-AE (The Intelligent League — Automated Exploration)
	- Game: Multi-agent Bomberman on a 16×16 grid
	- My Role: Train `agent_0` via RL to compete autonomously
	- Main Repo: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs)
	- Space: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`)
	- TIL Source: Private Space `e-rong/til-26-ae` — contains `til_environment/` module

	---

	## CRITICAL: What Killed Training & Cost Money

	### ❌ NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES

	Sandboxes are interactive dev environments. They:
	- Recycle after inactivity / timeout
	- Kill processes silently
	- Keep billing you while empty after the process dies

	Damage done: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued.

	### ✅ ALWAYS USE HF JOBS FOR BATCH TRAINING

	- Persistent GPU allocation
	- Runs until completion (or your timeout)
	- Fails visibly if something breaks (no silent empty billing)
	- Must set `namespace="E-Rong"` to bill the org, not the user

	### ❌ NEVER `git clone` A PRIVATE REPO IN AN HF JOB

	`git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`.

	Use instead:
	```python
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id='e-rong/til-26-ae',
	repo_type='space',
	local_dir='/app/til-26-ae-repo'
	)
	```
	`snapshot_download` auto-uses the `HF_TOKEN` env var.

	### ✅ ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN

	Submit a 5-minute job that:
	1. Downloads the TIL repo
	2. Installs deps
	3. Runs 100 training steps
	4. Saves a dummy checkpoint to the Hub

	Only after this succeeds, submit the multi-hour job.

	---

	## Session Startup Checklist

	Before doing anything on this project:

	1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent`
	2. [ ] Read this file (`AGENTS.md`)
	3. [ ] Check latest checkpoint on Hub (sort `phase_ckpt_.zip` files)
	4. [ ] Determine current phase and remaining steps
	5. [ ] If training needed: write script to sandbox, smoke-test in HF Job first

	---

	## Technical Decisions That Work

	### MaskablePPO + Action Masking
	- `sb3_contrib.MaskablePPO` with `ActionMasker`
	- Bomberman has `action_mask: uint8[6]` — walls/edges make moves illegal
	- Standard PPO wastes ~30-40% samples on illegal actions early on
	- Papers: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171)

	### Observation Flattening
	1511-dim vector from dict observation:
	```
	agent_viewcone: 7×5×25 = 875
	base_viewcone: 5×5×25 = 625
	direction, location[2], base_location[2], health, frozen_ticks,
	base_health, team_resources, team_bombs, step = 11 scalars
	Total: 1511
	```

	### Wrapper Order (CRITICAL)
	```python
	# CORRECT
	env = ActionMasker(base_env, lambda e: e.action_masks())
	env = Monitor(env)

	# WRONG — Monitor blocks action_masks() exposure
	env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
	```

	### 3-Phase Curriculum
	\| Phase \| Opponent \| Duration \| Purpose \|
	\|---\|---\|---\|---\|
	\| 1 \| Random \| 500k \| Learn basics \|
	\| 2 \| Random + visit-count shaping \| 500k \| Prevent camping \|
	\| 3 \| Rule-based curriculum \| 1M \| Generalize to structured opponents \|

	### Checkpointing Every 50k Steps
	- Local + Hub push via `HfApi.upload_file()`
	- Saved the project when sandboxes reset at 400k and 600k steps

	---

	## Technical Decisions That Failed

	\| Decision \| Why It Failed \| Fix \|
	\|---\|---\|---\|
	\| Training in sandboxes \| Process died, empty sandbox kept billing \| Use HF Jobs \|
	\| `git clone` in HF Job \| No auth for private repo \| `snapshot_download` \|
	\| Inline 20KB script in `hf_jobs.script` \| Delivery mechanism choked \| Write to sandbox file first, submit path \|
	\| No session state on Hub \| Lost track of progress across resets \| `session_state.json` + this file \|
	\| `Monitor` inside `ActionMasker` \| `get_action_masks()` failed \| `ActionMasker` → `Monitor` order \|

	---

	## Cost Awareness

	\| Hardware \| $/hr \| Good For \|
	\|---\|---\|---\|
	\| `cpu-basic` \| ~$0.05 \| Writing scripts, reading files, small tests \|
	\| `t4-small` \| ~$0.40 \| Short dev, NOT training \|
	\| `a10g-small` \| ~$1.00 \| Training, but use HF Jobs not sandboxes \|
	\| `a10g-large` \| ~$2.00 \| Larger batch sizes, not needed for this project \|

	Rule: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only.

	---

	## Sandbox Policy (User Mandate)

	> From this point forward, the user has mandated:

	1. Start `cpu-basic` sandbox at the beginning of every session
	2. Use `cpu-basic` for: context, writing code, writing docs, editing files, planning
	3. Only switch to GPU sandbox (`t4-small` or `a10g-small`) when performing smoke tests for training scripts
	4. Stop GPU sandbox IMMEDIATELY after the smoke test completes
	5. Training tasks ONLY as HF Jobs — never leave a training process running in a sandbox
	6. Never leave a GPU sandbox running idle — this wastes money

	Why this matters: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute.

	---

	## Repo File Guide

	\| File \| What It Is \|
	\|---\|---\|
	\| `session_state.json` \| Current phase, checkpoint, mistakes log, next steps \|
	\| `AGENTS.md` \| This file — lessons and context \|
	\| `docs/ae.md` \| Full project documentation (research, design, results) \|
	\| `phase1_final.zip` \| Complete Phase 1 model \|
	\| `phase2_ckpt_*.zip` \| Phase 2 intermediate checkpoints \|
	\| `phase2_final.zip` \| Phase 2 complete model (when done) \|
	\| `ae_manager.py` \| Inference code for the evaluation server \|
	\| `phase2_job.py` \| Latest HF Job script (may need fixes) \|
	\| `train_all_phases.py` \| Original training script \|

	---

	## Current State (Last Updated: 2026-05-14)

	- Phase: 2 (in progress, crashed)
	- Latest checkpoint: `phase2_ckpt_600352.zip`
	- Target: 1,000,352 total steps (400,000 remaining)
	- Blocker: Need to resume via HF Job with `snapshot_download` auth
	- Smoke test needed: YES — before any multi-hour job

	---

	## Contact / Ownership

	- User: E-Rong
	- Org: E-Rong
	- Billing: Must specify `namespace="E-Rong"` on all HF Jobs
	- I am: An ephemeral agent with no persistent memory. I depend on this file.

	---

	If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.