Add HF Jobs research — snapshot_download for private repos, script submission pattern
Browse files
AGENTS.md
CHANGED
|
@@ -154,6 +154,57 @@ env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
|
|
| 154 |
|
| 155 |
---
|
| 156 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 157 |
## Repo File Guide
|
| 158 |
|
| 159 |
| File | What It Is |
|
|
@@ -166,6 +217,7 @@ env = ActionMasker(Monitor(base_env), ...) # DON'T DO THIS
|
|
| 166 |
| `phase2_final.zip` | Phase 2 complete model (when done) |
|
| 167 |
| `ae_manager.py` | Inference code for the evaluation server |
|
| 168 |
| `phase2_job.py` | Latest HF Job script (may need fixes) |
|
|
|
|
| 169 |
| `train_all_phases.py` | Original training script |
|
| 170 |
|
| 171 |
---
|
|
|
|
| 154 |
|
| 155 |
---
|
| 156 |
|
| 157 |
+
## How to Submit HF Jobs Correctly (Research Results)
|
| 158 |
+
|
| 159 |
+
### Based on `huggingface.co/docs/hub/jobs-quickstart`:
|
| 160 |
+
|
| 161 |
+
**DO NOT use `git clone` for private repos.**
|
| 162 |
+
|
| 163 |
+
```python
|
| 164 |
+
# WRONG ❌
|
| 165 |
+
import subprocess
|
| 166 |
+
subprocess.run(["git", "clone", "https://huggingface.co/spaces/e-rong/til-26-ae"])
|
| 167 |
+
# Fails: git does not read HF_TOKEN env var
|
| 168 |
+
|
| 169 |
+
# CORRECT ✅
|
| 170 |
+
from huggingface_hub import snapshot_download
|
| 171 |
+
snapshot_download(
|
| 172 |
+
repo_id="e-rong/til-26-ae",
|
| 173 |
+
repo_type="space",
|
| 174 |
+
local_dir="/app/til-26-ae-repo"
|
| 175 |
+
)
|
| 176 |
+
# snapshot_download auto-uses HF_TOKEN from environment
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
### Script Submission Pattern
|
| 180 |
+
|
| 181 |
+
```python
|
| 182 |
+
# Step 1: Write script to sandbox file first
|
| 183 |
+
write(path="/app/train.py", content="...")
|
| 184 |
+
|
| 185 |
+
# Step 2: Submit as file path (not inline)
|
| 186 |
+
hf_jobs(
|
| 187 |
+
operation="run",
|
| 188 |
+
script="/app/train.py", # ← sandbox file path, gets uploaded
|
| 189 |
+
dependencies=["torch", "sb3-contrib", "gymnasium", "pettingzoo",
|
| 190 |
+
"numpy", "huggingface_hub", "pygame", "omegaconf",
|
| 191 |
+
"mazelib", "imageio", "imageio-ffmpeg", "supersuit", "psutil"],
|
| 192 |
+
hardware_flavor="a10g-small",
|
| 193 |
+
timeout="6h",
|
| 194 |
+
namespace="E-Rong" # ← bills to org
|
| 195 |
+
)
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
The `script` parameter is a **sandbox file path** that gets uploaded to the job container. `dependencies` maps to `--with` in the `uv run` CLI.
|
| 199 |
+
|
| 200 |
+
### Job Persistence
|
| 201 |
+
- Jobs run on HF infrastructure, not in your sandbox
|
| 202 |
+
- The sandbox can die — the job keeps running
|
| 203 |
+
- Check logs with `hf_jobs(operation="logs", job_id="...")`
|
| 204 |
+
- Job storage is ephemeral — **push checkpoints to Hub** (not just local)
|
| 205 |
+
|
| 206 |
+
---
|
| 207 |
+
|
| 208 |
## Repo File Guide
|
| 209 |
|
| 210 |
| File | What It Is |
|
|
|
|
| 217 |
| `phase2_final.zip` | Phase 2 complete model (when done) |
|
| 218 |
| `ae_manager.py` | Inference code for the evaluation server |
|
| 219 |
| `phase2_job.py` | Latest HF Job script (may need fixes) |
|
| 220 |
+
| `smoke_test.py` | 5-minute validation job — test before any real job |
|
| 221 |
| `train_all_phases.py` | Original training script |
|
| 222 |
|
| 223 |
---
|