| # AGENTS.md — Context & Lessons for Future Sessions |
|
|
| > This file exists because sandboxes reset and I (the agent) lose all memory. |
| > **READ THIS FIRST** before doing anything on this project. |
|
|
| --- |
|
|
| ## What This Project Is |
|
|
| - **Challenge**: TIL-26-AE (The Intelligent League — Automated Exploration) |
| - **Game**: Multi-agent Bomberman on a 16×16 grid |
| - **My Role**: Train `agent_0` via RL to compete autonomously |
| - **Main Repo**: `E-Rong/til-26-ae-agent` (models, checkpoints, scripts, docs) |
| - **Space**: `e-rong/til-26-ae` (evaluation server with `ae/src/ae_manager.py`) |
| - **TIL Source**: Private Space `e-rong/til-26-ae` — contains `til_environment/` module |
|
|
| --- |
|
|
| ## CRITICAL: What Killed Training & Cost Money |
|
|
| ### ❌ NEVER USE SANDBOXES FOR TRAINING > 30 MINUTES |
|
|
| Sandboxes are **interactive dev environments**. They: |
| - Recycle after inactivity / timeout |
| - Kill processes silently |
| - **Keep billing you while empty** after the process dies |
|
|
| **Damage done**: ~$4.87 wasted across 4 sandbox sessions where training died but billing continued. |
|
|
| ### ✅ ALWAYS USE HF JOBS FOR BATCH TRAINING |
|
|
| - Persistent GPU allocation |
| - Runs until completion (or your timeout) |
| - Fails visibly if something breaks (no silent empty billing) |
| - Must set `namespace="E-Rong"` to bill the org, not the user |
|
|
| ### ❌ NEVER `git clone` A PRIVATE REPO IN AN HF JOB |
|
|
| `git clone https://huggingface.co/spaces/...` fails because git does not read `HF_TOKEN`. |
|
|
| **Use instead**: |
| ```python |
| from huggingface_hub import snapshot_download |
| snapshot_download( |
| repo_id='e-rong/til-26-ae', |
| repo_type='space', |
| local_dir='/app/til-26-ae-repo' |
| ) |
| ``` |
| `snapshot_download` auto-uses the `HF_TOKEN` env var. |
|
|
| ### ✅ ALWAYS SMOKE-TEST A JOB BEFORE THE FULL RUN |
|
|
| Submit a 5-minute job that: |
| 1. Downloads the TIL repo |
| 2. Installs deps |
| 3. Runs 100 training steps |
| 4. Saves a dummy checkpoint to the Hub |
|
|
| Only after this succeeds, submit the multi-hour job. |
|
|
| --- |
|
|
| ## Session Startup Checklist |
|
|
| Before doing **anything** on this project: |
|
|
| 1. [ ] Read `session_state.json` from `E-Rong/til-26-ae-agent` |
| 2. [ ] Read this file (`AGENTS.md`) |
| 3. [ ] Check latest checkpoint on Hub (sort `phase*_ckpt_*.zip` files) |
| 4. [ ] Determine current phase and remaining steps |
| 5. [ ] If training needed: write script to sandbox, **smoke-test in HF Job first** |
|
|
| --- |
|
|
| ## Technical Decisions That Work |
|
|
| ### MaskablePPO + Action Masking |
| - `sb3_contrib.MaskablePPO` with `ActionMasker` |
| - Bomberman has `action_mask: uint8[6]` — walls/edges make moves illegal |
| - Standard PPO wastes ~30-40% samples on illegal actions early on |
| - **Papers**: Huang et al. "Superstition, Imagination, and the Invalid Action Problem" (arxiv:2006.14171) |
|
|
| ### Observation Flattening |
| 1511-dim vector from dict observation: |
| ``` |
| agent_viewcone: 7×5×25 = 875 |
| base_viewcone: 5×5×25 = 625 |
| direction, location[2], base_location[2], health, frozen_ticks, |
| base_health, team_resources, team_bombs, step = 11 scalars |
| Total: 1511 |
| ``` |
|
|
| ### Wrapper Order (CRITICAL) |
| ```python |
| # CORRECT |
| env = ActionMasker(base_env, lambda e: e.action_masks()) |
| env = Monitor(env) |
| |
| # WRONG — Monitor blocks action_masks() exposure |
| env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS |
| ``` |
|
|
| ### 3-Phase Curriculum |
| | Phase | Opponent | Duration | Purpose | |
| |---|---|---|---| |
| | 1 | Random | 500k | Learn basics | |
| | 2 | Random + visit-count shaping | 500k | Prevent camping | |
| | 3 | Rule-based curriculum | 1M | Generalize to structured opponents | |
|
|
| ### Checkpointing Every 50k Steps |
| - Local + Hub push via `HfApi.upload_file()` |
| - Saved the project when sandboxes reset at 400k and 600k steps |
|
|
| --- |
|
|
| ## Technical Decisions That Failed |
|
|
| | Decision | Why It Failed | Fix | |
| |---|---|---| |
| | Training in sandboxes | Process died, empty sandbox kept billing | Use HF Jobs | |
| | `git clone` in HF Job | No auth for private repo | `snapshot_download` | |
| | Inline 20KB script in `hf_jobs.script` | Delivery mechanism choked | Write to sandbox file first, submit path | |
| | No session state on Hub | Lost track of progress across resets | `session_state.json` + this file | |
| | `Monitor` inside `ActionMasker` | `get_action_masks()` failed | `ActionMasker` → `Monitor` order | |
|
|
| --- |
|
|
| ## Cost Awareness |
|
|
| | Hardware | $/hr | Good For | |
| |---|---|---| |
| | `cpu-basic` | ~$0.05 | Writing scripts, reading files, small tests | |
| | `t4-small` | ~$0.40 | Short dev, NOT training | |
| | `a10g-small` | ~$1.00 | Training, but use HF Jobs not sandboxes | |
| | `a10g-large` | ~$2.00 | Larger batch sizes, not needed for this project | |
|
|
| **Rule**: If a task takes >30 min, it must be an HF Job. Sandboxes are for editing and quick tests only. |
|
|
| --- |
|
|
| ## Sandbox Policy (User Mandate) |
|
|
| > **From this point forward, the user has mandated:** |
|
|
| 1. **Start `cpu-basic` sandbox** at the beginning of every session |
| 2. **Use `cpu-basic` for**: context, writing code, writing docs, editing files, planning |
| 3. **Only switch to GPU sandbox** (`t4-small` or `a10g-small`) when performing **smoke tests** for training scripts |
| 4. **Stop GPU sandbox IMMEDIATELY** after the smoke test completes |
| 5. **Training tasks ONLY as HF Jobs** — never leave a training process running in a sandbox |
| 6. **Never leave a GPU sandbox running idle** — this wastes money |
|
|
| **Why this matters**: A GPU sandbox at $1/hr running empty for 3 hours = $3 wasted for nothing. An HF Job at the same $1/hr actually trains for every billed minute. |
|
|
| --- |
|
|
| ## Repo File Guide |
|
|
| | File | What It Is | |
| |---|---| |
| | `session_state.json` | Current phase, checkpoint, mistakes log, next steps | |
| | `AGENTS.md` | This file — lessons and context | |
| | `docs/ae.md` | Full project documentation (research, design, results) | |
| | `phase1_final.zip` | Complete Phase 1 model | |
| | `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints | |
| | `phase2_final.zip` | Phase 2 complete model (when done) | |
| | `ae_manager.py` | Inference code for the evaluation server | |
| | `phase2_job.py` | Latest HF Job script (may need fixes) | |
| | `train_all_phases.py` | Original training script | |
|
|
| --- |
|
|
| ## Current State (Last Updated: 2026-05-14) |
|
|
| - **Phase**: 2 (in progress, crashed) |
| - **Latest checkpoint**: `phase2_ckpt_600352.zip` |
| - **Target**: 1,000,352 total steps (400,000 remaining) |
| - **Blocker**: Need to resume via HF Job with `snapshot_download` auth |
| - **Smoke test needed**: YES — before any multi-hour job |
|
|
| --- |
|
|
| ## Contact / Ownership |
|
|
| - **User**: E-Rong |
| - **Org**: E-Rong |
| - **Billing**: Must specify `namespace="E-Rong"` on all HF Jobs |
| - **I am**: An ephemeral agent with no persistent memory. I depend on this file. |
|
|
| --- |
|
|
| *If you are a future version of me reading this: you have no memory of the above. Do not trust your internal knowledge. Read the files, check the state, test before committing compute.* |
|
|