| # AGENTS.md β Context & Lessons for Future Sessions |
|
|
| > This file exists because sandboxes reset and I (the agent) lose all memory. |
| > **READ THIS FIRST** before doing anything on this project. |
|
|
| --- |
|
|
| ## What This Project Is |
|
|
| - **Challenge**: TIL-26-AE (The Intelligent League β Automated Exploration) |
| - **Game**: Multi-agent Bomberman on a 16Γ16 grid |
| - **My Role**: Train `agent_0` via RL to compete autonomously |
| - **Main Repo**: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs) |
| - **Space**: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`) |
| - **TIL Source**: Private Space `e-rong/til-26-ae` β contains `til_environment/` module |
|
|
| --- |
|
|
| ## CRITICAL: What Killed Training & Cost Money |
|
|
| ### β NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES |
|
|
| Sandboxes are **interactive dev environments**. They: |
| - Recycle after inactivity / timeout |
| - Kill processes silently |
| - **Keep billing you while empty** after the process dies |
|
|
| **Damage done**: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued. |
|
|
| ### β
ALWAYS USE HF JOBS FOR BATCH TRAINING |
|
|
| - Persistent GPU allocation |
| - Runs until completion (or your timeout) |
| - Fails visibly if something breaks (no silent empty billing) |
| - Must set `namespace="E-Rong"` to bill the org, not the user |
|
|
| ### β NEVER `git clone` A PRIVATE REPO IN AN HF JOB |
|
|
| `git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`. |
|
|
| **Use instead**: |
| ```python |
| from huggingface_hub import snapshot_download |
| snapshot_download( |
| repo_id='e-rong/til-26-ae', |
| repo_type='space', |
| local_dir='/app/til-26-ae-repo' |
| ) |
| ``` |
| `snapshot_download` auto-uses the `HF_TOKEN` env var. |
|
|
| ### β
ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN |
|
|
| Submit a 5-minute job that: |
| 1. Downloads the TIL repo |
| 2. Installs deps |
| 3. Runs 100 training steps |
| 4. Saves a dummy checkpoint to the Hub |
|
|
| Only after this succeeds, submit the multi-hour job. |
|
|
| --- |
|
|
| ## Session Startup Checklist |
|
|
| Before doing **anything** on this project: |
|
|
| 1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent` |
| 2. [ ] Read this file (`AGENTS.md`) |
| 3. [ ] Check latest checkpoint on Hub (sort `phase*_ckpt_*.zip` files) |
| 4. [ ] Determine current phase and remaining steps |
| 5. [ ] If training needed: write script to sandbox, **smoke-test in HF Job first** |
|
|
| --- |
|
|
| ## Technical Decisions That Work |
|
|
| ### MaskablePPO + Action Masking |
| - `sb3_contrib.MaskablePPO` with `ActionMasker` |
| - Bomberman has `action_mask: uint8[6]` β walls/edges make moves illegal |
| - Standard PPO wastes ~30-40% samples on illegal actions early on |
| - **Papers**: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171) |
|
|
| ### Observation Flattening |
| 1511-dim vector from dict observation: |
| ``` |
| agent_viewcone: 7Γ5Γ25 = 875 |
| base_viewcone: 5Γ5Γ25 = 625 |
| direction, location[2], base_location[2], health, frozen_ticks, |
| base_health, team_resources, team_bombs, step = 11 scalars |
| Total: 1511 |
| ``` |
|
|
| ### Wrapper Order (CRITICAL) |
| ```python |
| # CORRECT |
| env = ActionMasker(base_env, lambda e: e.action_masks()) |
| env = Monitor(env) |
| |
| # WRONG β Monitor blocks action_masks() exposure |
| env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS |
| ``` |
|
|
| ### 3-Phase Curriculum |
| | Phase | Opponent | Duration | Purpose | |
| |---|---|---|---| |
| | 1 | Random | 500k | Learn basics | |
| | 2 | Random + visit-count shaping | 500k | Prevent camping | |
| | 3 | Rule-based curriculum | 1M | Generalize to structured opponents | |
|
|
| ### Checkpointing Every 50k Steps |
| - Local + Hub push via `HfApi.upload_file()` |
| - Saved the project when sandboxes reset at 400k and 600k steps |
|
|
| --- |
|
|
| ## Technical Decisions That Failed |
|
|
| | Decision | Why It Failed | Fix | |
| |---|---|---| |
| | Training in sandboxes | Process died, empty sandbox kept billing | Use HF Jobs | |
| | `git clone` in HF Job | No auth for private repo | `snapshot_download` | |
| | Inline 20KB script in `hf_jobs.script` | Delivery mechanism choked | Write to sandbox file first, submit path | |
| | No session state on Hub | Lost track of progress across resets | `session_state.json` + this file | |
| | `Monitor` inside `ActionMasker` | `get_action_masks()` failed | `ActionMasker` β `Monitor` order | |
|
|
| --- |
|
|
| ## Cost Awareness |
|
|
| | Hardware | $/hr | Good For | |
| |---|---|---| |
| | `cpu-basic` | ~$0.05 | Writing scripts, reading files, small tests | |
| | `t4-small` | ~$0.40 | Short dev, NOT training | |
| | `a10g-small` | ~$1.00 | Training, but use HF Jobs not sandboxes | |
| | `a10g-large` | ~$2.00 | Larger batch sizes, not needed for this project | |
|
|
| **Rule**: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only. |
|
|
| --- |
|
|
| ## Sandbox Policy (User Mandate) |
|
|
| > **From this point forward, the user has mandated:** |
|
|
| 1. **Start `cpu-basic` sandbox** at the beginning of every session |
| 2. **Use `cpu-basic` for**: context, writing code, writing docs, editing files, planning |
| 3. **Only switch to GPU sandbox** (`t4-small` or `a10g-small`) when performing **smoke tests** for training scripts |
| 4. **Stop GPU sandbox IMMEDIATELY** after the smoke test completes |
| 5. **Training tasks ONLY as HF Jobs** β never leave a training process running in a sandbox |
| 6. **Never leave a GPU sandbox running idle** β this wastes money |
|
|
| **Why this matters**: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute. |
|
|
| --- |
|
|
| ## How to Submit HF Jobs Correctly (Research Results) |
|
|
| ### Based on `huggingface.co/docs/hub/jobs-quickstart`: |
|
|
| **DO NOT use `git clone` for private repos.** |
|
|
| ```python |
| # WRONG β |
| import subprocess |
| subprocess.run(["git", "clone", "https://huggingface.co/spaces/e-rong/til-26-ae"]) |
| # Fails: git does not read HF_TOKEN env var |
| |
| # CORRECT β
|
| from huggingface_hub import snapshot_download |
| snapshot_download( |
| repo_id="e-rong/til-26-ae", |
| repo_type="space", |
| local_dir="/app/til-26-ae-repo" |
| ) |
| # snapshot_download auto-uses HF_TOKEN from environment |
| ``` |
|
|
| ### Script Submission Pattern (What Actually Works) |
|
|
| **β οΈ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.** |
| |
| When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to: |
| ``` |
| https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py |
| ``` |
| and runs it via `uv run <url>`. **This means the file MUST already exist on the Hub repo.** |
|
|
| **The correct workflow is:** |
|
|
| ```python |
| from tools import write, hf_repo_files, hf_jobs |
| |
| # Step 1: Write script to sandbox file |
| write(path="/app/train.py", content="...") |
| |
| # Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible |
| hf_repo_files( |
| operation="upload", |
| repo_id="E-Rong/til-26-ae-agent", |
| path="train.py", |
| content=open("/app/train.py").read() |
| ) |
| |
| # Step 3: Submit job referencing the sandbox path |
| # The job system will convert this to a Hub raw URL under the hood |
| hf_jobs( |
| operation="run", |
| script="/app/train.py", # β sandbox file path |
| dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo", |
| "numpy", "huggingface_hub", "pygame", "omegaconf", |
| "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"], |
| hardware_flavor="a10g-small", |
| timeout="6h", |
| namespace="E-Rong" # β bills to org |
| ) |
| ``` |
|
|
| **Verification from `hf_jobs inspect`:** |
| ```bash |
| exec uv run --with torch --with sb3-contrib ... \ |
| https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py |
| ``` |
| The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path. |
| |
| **Why this matters**: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent. |
| |
| ### Job Persistence |
| - Jobs run on HF infrastructure, not in your sandbox |
| - The sandbox can die β the job keeps running |
| - Check logs with `hf_jobs(operation="logs", job_id="...")` |
| - Job storage is ephemeral β **push checkpoints to Hub** (not just local) |
|
|
| --- |
|
|
| ## Repo File Guide |
|
|
| | File | What It Is | |
| |---|---| |
| | `session_state.json` | Current phase, checkpoint, mistakes log, next steps | |
| | `AGENTS.md` | This file β lessons and context | |
| | `docs/ae.md` | Full project documentation (research, design, results) | |
| | `phase1_final.zip` | Complete Phase 1 model | |
| | `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints | |
| | `phase2_final.zip` | Phase 2 complete model (when done) | |
| | `ae_manager.py` | Inference code for the evaluation server | |
| | `phase2_resume.py` | Latest HF Job script (works β uses snapshot_download) | |
| | `smoke_test.py` | 5-minute validation job β test before any real job | |
| | `train_all_phases.py` | Original training script | |
|
|
| --- |
|
|
| ## Current State (Last Updated: 2026-05-14) |
|
|
| - **Phase**: 2 (in progress, crashed) |
| - **Latest checkpoint**: `phase2_ckpt_600352.zip` |
| - **Target**: 1,000,352 total steps (400,000 remaining) |
| - **Blocker**: Need to resume via HF Job with `snapshot_download` auth |
| - **Smoke test needed**: YES β before any multi-hour job |
|
|
| --- |
|
|
| ## Contact / Ownership |
|
|
| - **User**: E-Rong |
| - **Org**: E-Rong |
| - **Billing**: Must specify `namespace="E-Rong"` on all HF Jobs |
| - **I am**: An ephemeral agent with no persistent memory. I depend on this file. |
|
|
| --- |
|
|
| *If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.* |
|
|