Update AGENTS.md: document how hf_jobs script parameter actually works (converts to raw Hub URL)

3745a2d verified 2 days ago

9.49 kB

	# AGENTS.md — Context & Lessons for Future Sessions

	> This file exists because sandboxes reset and I (the agent) lose all memory.
	> READ THIS FIRST before doing anything on this project.

	---

	## What This Project Is

	- Challenge: TIL-26-AE (The Intelligent League — Automated Exploration)
	- Game: Multi-agent Bomberman on a 16×16 grid
	- My Role: Train `agent_0` via RL to compete autonomously
	- Main Repo: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs)
	- Space: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`)
	- TIL Source: Private Space `e-rong/til-26-ae` — contains `til_environment/` module

	---

	## CRITICAL: What Killed Training & Cost Money

	### ❌ NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES

	Sandboxes are interactive dev environments. They:
	- Recycle after inactivity / timeout
	- Kill processes silently
	- Keep billing you while empty after the process dies

	Damage done: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued.

	### ✅ ALWAYS USE HF JOBS FOR BATCH TRAINING

	- Persistent GPU allocation
	- Runs until completion (or your timeout)
	- Fails visibly if something breaks (no silent empty billing)
	- Must set `namespace="E-Rong"` to bill the org, not the user

	### ❌ NEVER `git clone` A PRIVATE REPO IN AN HF JOB

	`git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`.

	Use instead:
	```python
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id='e-rong/til-26-ae',
	repo_type='space',
	local_dir='/app/til-26-ae-repo'
	)
	```
	`snapshot_download` auto-uses the `HF_TOKEN` env var.

	### ✅ ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN

	Submit a 5-minute job that:
	1. Downloads the TIL repo
	2. Installs deps
	3. Runs 100 training steps
	4. Saves a dummy checkpoint to the Hub

	Only after this succeeds, submit the multi-hour job.

	---

	## Session Startup Checklist

	Before doing anything on this project:

	1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent`
	2. [ ] Read this file (`AGENTS.md`)
	3. [ ] Check latest checkpoint on Hub (sort `phase_ckpt_.zip` files)
	4. [ ] Determine current phase and remaining steps
	5. [ ] If training needed: write script to sandbox, smoke-test in HF Job first

	---

	## Technical Decisions That Work

	### MaskablePPO + Action Masking
	- `sb3_contrib.MaskablePPO` with `ActionMasker`
	- Bomberman has `action_mask: uint8[6]` — walls/edges make moves illegal
	- Standard PPO wastes ~30-40% samples on illegal actions early on
	- Papers: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171)

	### Observation Flattening
	1511-dim vector from dict observation:
	```
	agent_viewcone: 7×5×25 = 875
	base_viewcone: 5×5×25 = 625
	direction, location[2], base_location[2], health, frozen_ticks,
	base_health, team_resources, team_bombs, step = 11 scalars
	Total: 1511
	```

	### Wrapper Order (CRITICAL)
	```python
	# CORRECT
	env = ActionMasker(base_env, lambda e: e.action_masks())
	env = Monitor(env)

	# WRONG — Monitor blocks action_masks() exposure
	env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
	```

	### 3-Phase Curriculum
	\| Phase \| Opponent \| Duration \| Purpose \|
	\|---\|---\|---\|---\|
	\| 1 \| Random \| 500k \| Learn basics \|
	\| 2 \| Random + visit-count shaping \| 500k \| Prevent camping \|
	\| 3 \| Rule-based curriculum \| 1M \| Generalize to structured opponents \|

	### Checkpointing Every 50k Steps
	- Local + Hub push via `HfApi.upload_file()`
	- Saved the project when sandboxes reset at 400k and 600k steps

	---

	## Technical Decisions That Failed

	\| Decision \| Why It Failed \| Fix \|
	\|---\|---\|---\|
	\| Training in sandboxes \| Process died, empty sandbox kept billing \| Use HF Jobs \|
	\| `git clone` in HF Job \| No auth for private repo \| `snapshot_download` \|
	\| Inline 20KB script in `hf_jobs.script` \| Delivery mechanism choked \| Write to sandbox file first, submit path \|
	\| No session state on Hub \| Lost track of progress across resets \| `session_state.json` + this file \|
	\| `Monitor` inside `ActionMasker` \| `get_action_masks()` failed \| `ActionMasker` → `Monitor` order \|

	---

	## Cost Awareness

	\| Hardware \| $/hr \| Good For \|
	\|---\|---\|---\|
	\| `cpu-basic` \| ~$0.05 \| Writing scripts, reading files, small tests \|
	\| `t4-small` \| ~$0.40 \| Short dev, NOT training \|
	\| `a10g-small` \| ~$1.00 \| Training, but use HF Jobs not sandboxes \|
	\| `a10g-large` \| ~$2.00 \| Larger batch sizes, not needed for this project \|

	Rule: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only.

	---

	## Sandbox Policy (User Mandate)

	> From this point forward, the user has mandated:

	1. Start `cpu-basic` sandbox at the beginning of every session
	2. Use `cpu-basic` for: context, writing code, writing docs, editing files, planning
	3. Only switch to GPU sandbox (`t4-small` or `a10g-small`) when performing smoke tests for training scripts
	4. Stop GPU sandbox IMMEDIATELY after the smoke test completes
	5. Training tasks ONLY as HF Jobs — never leave a training process running in a sandbox
	6. Never leave a GPU sandbox running idle — this wastes money

	Why this matters: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute.

	---

	## How to Submit HF Jobs Correctly (Research Results)

	### Based on `huggingface.co/docs/hub/jobs-quickstart`:

	DO NOT use `git clone` for private repos.

	```python
	# WRONG ❌
	import subprocess
	subprocess.run(["git", "clone", "https://huggingface.co/spaces/e-rong/til-26-ae"])
	# Fails: git does not read HF_TOKEN env var

	# CORRECT ✅
	from huggingface_hub import snapshot_download
	snapshot_download(
	repo_id="e-rong/til-26-ae",
	repo_type="space",
	local_dir="/app/til-26-ae-repo"
	)
	# snapshot_download auto-uses HF_TOKEN from environment
	```

	### Script Submission Pattern (What Actually Works)

	⚠️ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.

	When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to:
	```
	https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py
	```
	and runs it via `uv run <url>`. This means the file MUST already exist on the Hub repo.

	The correct workflow is:

	```python
	from tools import write, hf_repo_files, hf_jobs

	# Step 1: Write script to sandbox file
	write(path="/app/train.py", content="...")

	# Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible
	hf_repo_files(
	operation="upload",
	repo_id="E-Rong/til-26-ae-agent",
	path="train.py",
	content=open("/app/train.py").read()
	)

	# Step 3: Submit job referencing the sandbox path
	# The job system will convert this to a Hub raw URL under the hood
	hf_jobs(
	operation="run",
	script="/app/train.py", # ← sandbox file path
	dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
	"numpy", "huggingface_hub", "pygame", "omegaconf",
	"mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
	hardware_flavor="a10g-small",
	timeout="6h",
	namespace="E-Rong" # ← bills to org
	)
	```

	Verification from `hf_jobs inspect`:
	```bash
	exec uv run --with torch --with sb3-contrib ... \
	https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py
	```
	The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path.

	Why this matters: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent.

	### Job Persistence
	- Jobs run on HF infrastructure, not in your sandbox
	- The sandbox can die — the job keeps running
	- Check logs with `hf_jobs(operation="logs", job_id="...")`
	- Job storage is ephemeral — push checkpoints to Hub (not just local)

	---

	## Repo File Guide

	\| File \| What It Is \|
	\|---\|---\|
	\| `session_state.json` \| Current phase, checkpoint, mistakes log, next steps \|
	\| `AGENTS.md` \| This file — lessons and context \|
	\| `docs/ae.md` \| Full project documentation (research, design, results) \|
	\| `phase1_final.zip` \| Complete Phase 1 model \|
	\| `phase2_ckpt_*.zip` \| Phase 2 intermediate checkpoints \|
	\| `phase2_final.zip` \| Phase 2 complete model (when done) \|
	\| `ae_manager.py` \| Inference code for the evaluation server \|
	\| `phase2_resume.py` \| Latest HF Job script (works — uses snapshot_download) \|
	\| `smoke_test.py` \| 5-minute validation job — test before any real job \|
	\| `train_all_phases.py` \| Original training script \|

	---

	## Current State (Last Updated: 2026-05-14)

	- Phase: 2 (in progress, crashed)
	- Latest checkpoint: `phase2_ckpt_600352.zip`
	- Target: 1,000,352 total steps (400,000 remaining)
	- Blocker: Need to resume via HF Job with `snapshot_download` auth
	- Smoke test needed: YES — before any multi-hour job

	---

	## Contact / Ownership

	- User: E-Rong
	- Org: E-Rong
	- Billing: Must specify `namespace="E-Rong"` on all HF Jobs
	- I am: An ephemeral agent with no persistent memory. I depend on this file.

	---

	If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.