E-Rong
/

til-26-ae-agent

ml-intern

Model card Files Files and versions

xet

Community

E-Rong commited on about 22 hours ago

Commit

3745a2d

verified ·

1 Parent(s): 7d18d2b

Update AGENTS.md: document how hf_jobs script parameter actually works (converts to raw Hub URL)

Browse files

Files changed (1) hide show

AGENTS.md +35 -7

AGENTS.md CHANGED Viewed

@@ -176,26 +176,54 @@ snapshot_download(
 # snapshot_download auto-uses HF_TOKEN from environment
 ```
-### Script Submission Pattern
 ```python
-# Step 1: Write script to sandbox file first
 write(path="/app/train.py", content="...")
-# Step 2: Submit as file path (not inline)
 hf_jobs(
     operation="run",
-    script="/app/train.py",  # ← sandbox file path, gets uploaded
     dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
                   "numpy", "huggingface_hub", "pygame", "omegaconf",
                   "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
     hardware_flavor="a10g-small",
     timeout="6h",
-    namespace="E-Rong"  # ← bills to org
 )
 ```
-The `script` parameter is a **sandbox file path** that gets uploaded to the job container. `dependencies` maps to `--with` in the `uv run` CLI.
 ### Job Persistence
 - Jobs run on HF infrastructure, not in your sandbox
@@ -216,7 +244,7 @@ The `script` parameter is a **sandbox file path** that gets uploaded to the job
 | `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
 | `phase2_final.zip` | Phase 2 complete model (when done) |
 | `ae_manager.py` | Inference code for the evaluation server |
-| `phase2_job.py` | Latest HF Job script (may need fixes) |
 | `smoke_test.py` | 5-minute validation job — test before any real job |
 | `train_all_phases.py` | Original training script |

 # snapshot_download auto-uses HF_TOKEN from environment
 ```
+### Script Submission Pattern (What Actually Works)
+**⚠️ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.**
+When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to:
+```
+https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py
+```
+and runs it via `uv run <url>`. **This means the file MUST already exist on the Hub repo.**
+**The correct workflow is:**
 ```python
+from tools import write, hf_repo_files, hf_jobs
+# Step 1: Write script to sandbox file
 write(path="/app/train.py", content="...")
+# Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible
+hf_repo_files(
+    operation="upload",
+    repo_id="E-Rong/til-26-ae-agent",
+    path="train.py",
+    content=open("/app/train.py").read()
+)
+# Step 3: Submit job referencing the sandbox path
+# The job system will convert this to a Hub raw URL under the hood
 hf_jobs(
     operation="run",
+    script="/app/train.py",           # ← sandbox file path
     dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
                   "numpy", "huggingface_hub", "pygame", "omegaconf",
                   "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
     hardware_flavor="a10g-small",
     timeout="6h",
+    namespace="E-Rong"              # ← bills to org
 )
 ```
+**Verification from `hf_jobs inspect`:**
+```bash
+exec uv run --with torch --with sb3-contrib ... \
+    https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py
+```
+The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path.
+**Why this matters**: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent.
 ### Job Persistence
 - Jobs run on HF infrastructure, not in your sandbox
 | `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
 | `phase2_final.zip` | Phase 2 complete model (when done) |
 | `ae_manager.py` | Inference code for the evaluation server |
+| `phase2_resume.py` | Latest HF Job script (works — uses snapshot_download) |
 | `smoke_test.py` | 5-minute validation job — test before any real job |
 | `train_all_phases.py` | Original training script |