E-Rong commited on
Commit
3745a2d
·
verified ·
1 Parent(s): 7d18d2b

Update AGENTS.md: document how hf_jobs script parameter actually works (converts to raw Hub URL)

Browse files
Files changed (1) hide show
  1. AGENTS.md +35 -7
AGENTS.md CHANGED
@@ -176,26 +176,54 @@ snapshot_download(
176
  # snapshot_download auto-uses HF_TOKEN from environment
177
  ```
178
 
179
- ### Script Submission Pattern
 
 
 
 
 
 
 
 
 
 
180
 
181
  ```python
182
- # Step 1: Write script to sandbox file first
 
 
183
  write(path="/app/train.py", content="...")
184
 
185
- # Step 2: Submit as file path (not inline)
 
 
 
 
 
 
 
 
 
186
  hf_jobs(
187
  operation="run",
188
- script="/app/train.py", # ← sandbox file path, gets uploaded
189
  dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
190
  "numpy", "huggingface_hub", "pygame", "omegaconf",
191
  "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
192
  hardware_flavor="a10g-small",
193
  timeout="6h",
194
- namespace="E-Rong" # ← bills to org
195
  )
196
  ```
197
 
198
- The `script` parameter is a **sandbox file path** that gets uploaded to the job container. `dependencies` maps to `--with` in the `uv run` CLI.
 
 
 
 
 
 
 
199
 
200
  ### Job Persistence
201
  - Jobs run on HF infrastructure, not in your sandbox
@@ -216,7 +244,7 @@ The `script` parameter is a **sandbox file path** that gets uploaded to the job
216
  | `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
217
  | `phase2_final.zip` | Phase 2 complete model (when done) |
218
  | `ae_manager.py` | Inference code for the evaluation server |
219
- | `phase2_job.py` | Latest HF Job script (may need fixes) |
220
  | `smoke_test.py` | 5-minute validation job — test before any real job |
221
  | `train_all_phases.py` | Original training script |
222
 
 
176
  # snapshot_download auto-uses HF_TOKEN from environment
177
  ```
178
 
179
+ ### Script Submission Pattern (What Actually Works)
180
+
181
+ **⚠️ CRITICAL DISCOVERY: The `script` parameter in `hf_jobs` becomes a RAW HUB URL.**
182
+
183
+ When you call `hf_jobs(script="/app/train.py")`, the job system does NOT upload the local file. Instead, it converts the path to:
184
+ ```
185
+ https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/train.py
186
+ ```
187
+ and runs it via `uv run <url>`. **This means the file MUST already exist on the Hub repo.**
188
+
189
+ **The correct workflow is:**
190
 
191
  ```python
192
+ from tools import write, hf_repo_files, hf_jobs
193
+
194
+ # Step 1: Write script to sandbox file
195
  write(path="/app/train.py", content="...")
196
 
197
+ # Step 2: ALSO upload to Hub repo so it's persisted and URL-accessible
198
+ hf_repo_files(
199
+ operation="upload",
200
+ repo_id="E-Rong/til-26-ae-agent",
201
+ path="train.py",
202
+ content=open("/app/train.py").read()
203
+ )
204
+
205
+ # Step 3: Submit job referencing the sandbox path
206
+ # The job system will convert this to a Hub raw URL under the hood
207
  hf_jobs(
208
  operation="run",
209
+ script="/app/train.py", # ← sandbox file path
210
  dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
211
  "numpy", "huggingface_hub", "pygame", "omegaconf",
212
  "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
213
  hardware_flavor="a10g-small",
214
  timeout="6h",
215
+ namespace="E-Rong" # ← bills to org
216
  )
217
  ```
218
 
219
+ **Verification from `hf_jobs inspect`:**
220
+ ```bash
221
+ exec uv run --with torch --with sb3-contrib ... \
222
+ https://huggingface.co/E-Rong/til-26-ae-agent/raw/main/phase2_resume.py
223
+ ```
224
+ The job fetches the script from the Hub, not from the sandbox. The sandbox path is just used to derive the repo/file path.
225
+
226
+ **Why this matters**: If you only write to `/app/train.py` and don't upload to the Hub, the job will fail with a 404 when it tries to fetch the URL. The sandbox resets, but the Hub URL is permanent.
227
 
228
  ### Job Persistence
229
  - Jobs run on HF infrastructure, not in your sandbox
 
244
  | `phase2_ckpt_*.zip` | Phase 2 intermediate checkpoints |
245
  | `phase2_final.zip` | Phase 2 complete model (when done) |
246
  | `ae_manager.py` | Inference code for the evaluation server |
247
+ | `phase2_resume.py` | Latest HF Job script (works uses snapshot_download) |
248
  | `smoke_test.py` | 5-minute validation job — test before any real job |
249
  | `train_all_phases.py` | Original training script |
250