ashirato commited on
Commit
a5c37dd
Β·
1 Parent(s): 6b77b52

start.sh: gate 5 heavy harvest launchers behind LOW_MEM

Browse files

CPU-Basic Space cap is 16 GB. With uvicorn + 5 nohup'd bash launchers
(dedup-bootstrap, dataset-enrich, kaggle-trainer, lightning-trainer,
dataset-mirror) running concurrently while harvest workers hammered
/cursor/* at ~100 req/min, the container blew through 16 GB roughly
5 min after every boot and HF auto-killed it. The watchdog correctly
flagged 'HTTP layer hung' but the actual signal was OOM_KILL.

Fix: gate the 5 launchers behind LOW_MEM=0. LOW_MEM defaults to 1 (set
in the global env at line ~20 BEFORE any reference, so 'set -u' is
happy) so on CPU-Basic the launchers no-op and the Space stays
under ~2 GB resident β€” just FastAPI + Redis + status server.

Harvest pipeline: the 3 recurring launchers (dataset-enrich every 30m,
dataset-mirror every 6h, lightning-trainer daily 02:00 UTC) move to
GCP scheduler via hermes-jobs.json entries. dedup-bootstrap is
one-shot so doesn't need a cron. kaggle-trainer is launched manually
from Mac for V19 runs.

Re-enable in-Space launchers by setting LOW_MEM=0 (Space variable)
once we upgrade to a paid tier (cpu-upgrade β‰₯ 32 GB) or migrate to a
larger anchor like OCI A1.Flex 24 GB ARM.

Files changed (1) hide show
  1. start.sh +38 -27
start.sh CHANGED
@@ -15,6 +15,13 @@ set -x
15
  # Echo stdout so HF run-logs see progress (safe steps before .env is loaded)
16
  exec > >(tee -a "$LOG_DIR/boot.log") 2>&1
17
 
 
 
 
 
 
 
 
18
  # ── 1. Persistent data β€” symlink state subdirs to /data (HF persistent mount) ──
19
  # bin/ is NOT persisted (baked into image, refreshed on every push).
20
  # Persisted: state (DBs), logs, memory, skills, sessions, training pairs,
@@ -86,34 +93,38 @@ if [[ -d "$DATA" ]] && [[ -w "$DATA" ]]; then
86
  fi
87
  fi
88
 
89
- # ── One-time central dedup bootstrap from existing data ──────────────
90
- if [[ ! -f "${HOME}/.surrogate/.dedup-bootstrap-done" ]]; then
91
- echo "[$(date +%H:%M:%S)] running central dedup bootstrap (one-time)" >> "$LOG_DIR/boot.log"
92
- nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
93
- fi
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
- # ── BOOT-TIME enrich kickoff (trigger immediate pull, don't wait for cron) ──
96
- # User feedback: 'ΰΈ•ΰΈ±ΰΉ‰ΰΈ‡ΰΉΰΈ•ΰΉˆ 7 ΰΉ‚ΰΈ‘ΰΈ‡ΰΉ€ΰΈŠΰΉ‰ΰΈ² ΰΉΰΈ•ΰΉˆ data ΰΉ„ΰΈ‘ΰΉˆΰΉ€ΰΈžΰΈ΄ΰΉˆΰΈ‘' β€” cron M%60 may have
97
- # been mis-aligned with rebuilds. Force one enrich run on every boot.
98
- nohup bash "${HOME}/.surrogate/bin/dataset-enrich.sh" >> "$LOG_DIR/dataset-enrich.log" 2>&1 &
99
- echo "[$(date +%H:%M:%S)] boot-time dataset-enrich kicked off" >> "$LOG_DIR/boot.log"
100
-
101
- # ── BOOT-TIME kaggle-trainer kickoff (don't wait for 90-min cron) ─────
102
- # Submits the LoRA training notebook to Kaggle T4 immediately on every
103
- # Space boot. Kaggle CLI gracefully handles 'already running' state.
104
- nohup bash "${HOME}/.surrogate/bin/kaggle-trainer.sh" >> "$LOG_DIR/kaggle-trainer.log" 2>&1 &
105
- echo "[$(date +%H:%M:%S)] boot-time kaggle-trainer kicked off" >> "$LOG_DIR/boot.log"
106
-
107
- # ── BOOT-TIME lightning-trainer kickoff β€” H200 4 hr free for big model ─
108
- nohup bash "${HOME}/.surrogate/bin/lightning-trainer.sh" >> "$LOG_DIR/lightning-trainer.log" 2>&1 &
109
- echo "[$(date +%H:%M:%S)] boot-time lightning-trainer kicked off (H200 4hr quota)" >> "$LOG_DIR/boot.log"
110
-
111
- # ── BOOT-TIME dataset-mirror β€” bulk-clone top community SFT mixes ──────
112
- # Far faster than streaming-and-normalize β€” 1 commit per parquet file
113
- # = millions of pairs landing as raw mirrors/<slug>/<file>. Idempotent
114
- # via stamp file so we don't re-mirror what's already been pulled.
115
- nohup bash "${HOME}/.surrogate/bin/dataset-mirror.sh" >> "$LOG_DIR/dataset-mirror.log" 2>&1 &
116
- echo "[$(date +%H:%M:%S)] boot-time dataset-mirror kicked off (30 community sources)" >> "$LOG_DIR/boot.log"
117
 
118
  echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
119
  else
 
15
  # Echo stdout so HF run-logs see progress (safe steps before .env is loaded)
16
  exec > >(tee -a "$LOG_DIR/boot.log") 2>&1
17
 
18
+ # ── Memory mode (must be set BEFORE any reference; we use `set -u`) ───────
19
+ # CPU-Basic Space = 16 GB cap. With LOW_MEM=1 we skip the heavy harvest
20
+ # launchers (dataset-enrich, dataset-mirror, kaggle-trainer, lightning-trainer,
21
+ # dedup-bootstrap) β€” those run on GCP daemons instead. Set LOW_MEM=0 only
22
+ # on a paid Space tier (cpu-upgrade β‰₯32 GB).
23
+ LOW_MEM="${LOW_MEM:-1}"
24
+
25
  # ── 1. Persistent data β€” symlink state subdirs to /data (HF persistent mount) ──
26
  # bin/ is NOT persisted (baked into image, refreshed on every push).
27
  # Persisted: state (DBs), logs, memory, skills, sessions, training pairs,
 
93
  fi
94
  fi
95
 
96
+ # ── Heavy harvest launchers β€” only on HIGH_MEM (LOW_MEM=0) ───────────
97
+ # On CPU-Basic (16 GB cap) launching 5 background bash + uvicorn + 5 harvest
98
+ # workers blew through the cap and HF auto-killed the container ~5 min after
99
+ # boot. These launchers are now scheduled on GCP via hermes-scheduler-daemon
100
+ # (entries in data/hermes-jobs.json) so harvest still runs β€” just not from
101
+ # inside the Space's RAM. Re-enable in-Space by setting LOW_MEM=0 once we
102
+ # upgrade to a β‰₯32 GB tier.
103
+ if [[ "$LOW_MEM" != "1" ]]; then
104
+ # ── One-time central dedup bootstrap from existing data ──────────
105
+ if [[ ! -f "${HOME}/.surrogate/.dedup-bootstrap-done" ]]; then
106
+ echo "[$(date +%H:%M:%S)] running central dedup bootstrap (one-time)" >> "$LOG_DIR/boot.log"
107
+ nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
108
+ fi
109
+
110
+ # ── BOOT-TIME enrich kickoff (trigger immediate pull, don't wait for cron)
111
+ nohup bash "${HOME}/.surrogate/bin/dataset-enrich.sh" >> "$LOG_DIR/dataset-enrich.log" 2>&1 &
112
+ echo "[$(date +%H:%M:%S)] boot-time dataset-enrich kicked off" >> "$LOG_DIR/boot.log"
113
 
114
+ # ── BOOT-TIME kaggle-trainer kickoff (don't wait for 90-min cron) ─
115
+ nohup bash "${HOME}/.surrogate/bin/kaggle-trainer.sh" >> "$LOG_DIR/kaggle-trainer.log" 2>&1 &
116
+ echo "[$(date +%H:%M:%S)] boot-time kaggle-trainer kicked off" >> "$LOG_DIR/boot.log"
117
+
118
+ # ── BOOT-TIME lightning-trainer kickoff β€” H200 4 hr free for big model
119
+ nohup bash "${HOME}/.surrogate/bin/lightning-trainer.sh" >> "$LOG_DIR/lightning-trainer.log" 2>&1 &
120
+ echo "[$(date +%H:%M:%S)] boot-time lightning-trainer kicked off (H200 4hr quota)" >> "$LOG_DIR/boot.log"
121
+
122
+ # ── BOOT-TIME dataset-mirror β€” bulk-clone top community SFT mixes ─
123
+ nohup bash "${HOME}/.surrogate/bin/dataset-mirror.sh" >> "$LOG_DIR/dataset-mirror.log" 2>&1 &
124
+ echo "[$(date +%H:%M:%S)] boot-time dataset-mirror kicked off (30 community sources)" >> "$LOG_DIR/boot.log"
125
+ else
126
+ echo "[$(date +%H:%M:%S)] LOW_MEM=1 β†’ skipped 5 heavy harvest launchers (delegated to GCP daemons)" >> "$LOG_DIR/boot.log"
127
+ fi
 
 
 
 
 
 
 
 
128
 
129
  echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
130
  else