E-Rong commited on
Commit
4b6177e
·
verified ·
1 Parent(s): 90f17a9

Add HF Jobs research — snapshot_download for private repos, script submission pattern

Browse files
Files changed (1) hide show
  1. AGENTS.md +52 -0
AGENTS.md CHANGED
@@ -154,6 +154,57 @@ env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
154
 
155
  ---
156
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
  ## Repo File Guide
158
 
159
  | File | What It Is |
@@ -166,6 +217,7 @@ env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
166
  | `phase2_final.zip` | Phase 2 complete model (when done) |
167
  | `ae_manager.py` | Inference code for the evaluation server |
168
  | `phase2_job.py` | Latest HF Job script (may need fixes) |
 
169
  | `train_all_phases.py` | Original training script |
170
 
171
  ---
 
154
 
155
  ---
156
 
157
+ ## How to Submit HF Jobs Correctly (Research Results)
158
+
159
+ ### Based on `huggingface.co/docs/hub/jobs-quickstart`:
160
+
161
+ **DO NOT use `git clone` for private repos.**
162
+
163
+ ```python
164
+ # WRONG ❌
165
+ import subprocess
166
+ subprocess.run(["git", "clone", "https://huggingface.co/spaces/e-rong/til-26-ae"])
167
+ # Fails: git does not read HF_TOKEN env var
168
+
169
+ # CORRECT ✅
170
+ from huggingface_hub import snapshot_download
171
+ snapshot_download(
172
+ repo_id="e-rong/til-26-ae",
173
+ repo_type="space",
174
+ local_dir="/app/til-26-ae-repo"
175
+ )
176
+ # snapshot_download auto-uses HF_TOKEN from environment
177
+ ```
178
+
179
+ ### Script Submission Pattern
180
+
181
+ ```python
182
+ # Step 1: Write script to sandbox file first
183
+ write(path="/app/train.py", content="...")
184
+
185
+ # Step 2: Submit as file path (not inline)
186
+ hf_jobs(
187
+ operation="run",
188
+ script="/app/train.py", # ← sandbox file path, gets uploaded
189
+ dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
190
+ "numpy", "huggingface_hub", "pygame", "omegaconf",
191
+ "mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
192
+ hardware_flavor="a10g-small",
193
+ timeout="6h",
194
+ namespace="E-Rong" # ← bills to org
195
+ )
196
+ ```
197
+
198
+ The `script` parameter is a **sandbox file path** that gets uploaded to the job container. `dependencies` maps to `--with` in the `uv run` CLI.
199
+
200
+ ### Job Persistence
201
+ - Jobs run on HF infrastructure, not in your sandbox
202
+ - The sandbox can die — the job keeps running
203
+ - Check logs with `hf_jobs(operation="logs", job_id="...")`
204
+ - Job storage is ephemeral — **push checkpoints to Hub** (not just local)
205
+
206
+ ---
207
+
208
  ## Repo File Guide
209
 
210
  | File | What It Is |
 
217
  | `phase2_final.zip` | Phase 2 complete model (when done) |
218
  | `ae_manager.py` | Inference code for the evaluation server |
219
  | `phase2_job.py` | Latest HF Job script (may need fixes) |
220
+ | `smoke_test.py` | 5-minute validation job — test before any real job |
221
  | `train_all_phases.py` | Original training script |
222
 
223
  ---