Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 11 days ago

Commit

99cff37

1 Parent(s): e179514

fix: OOM kill on cpu-basic — drop 16 shards to 6 + slower stagger

HF runtime API revealed the actual cause of the RUNTIME_ERROR cycles:
errorMessage: 'Memory limit exceeded (16Gi)'

Each shard streams datasets via the 'datasets' library, peaking around
1 GB while a parquet shard downloads + decodes. 16 of those plus 30
co-resident daemons plus Python heap simply does not fit under 16 GB.

Changes:
NUM_SHARDS: 16 -> 6 (~6 GB peak from shards, leaves headroom)
Stagger: 15s -> 30s (gradual ramp; lets early shards settle into
steady-state before later peers spool up)

Trade-off: throughput vs stability. 6 shards still process ~24 datasets
in parallel (each shard owns ~1/6 of the 146-dataset list). With the
filename-collision fix from the last commit (every iter uploads a unique
file), cumulative growth across iterations replaces what raw shard count
gave us.

Files changed (1) hide show

bin/bulk-ingest-parallel.sh +10 -4

bin/bulk-ingest-parallel.sh CHANGED Viewed

@@ -10,8 +10,12 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
 LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
 mkdir -p "$(dirname "$LOG")"
-NUM_SHARDS="${INGEST_SHARDS:-16}"
-SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}"  # 2 min between shard cycles (was 3)
 echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
@@ -28,9 +32,11 @@ shard_loop() {
     done
 }
-# Stagger startup 15s apart (was 30s) to spin up faster
 for i in $(seq 0 $((NUM_SHARDS - 1))); do
     shard_loop "$i" "$NUM_SHARDS" &
-    sleep 15
 done
 wait

 LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
 mkdir -p "$(dirname "$LOG")"
+NUM_SHARDS="${INGEST_SHARDS:-6}"           # was 16 — caused Memory limit exceeded (16Gi)
+                                            # on cpu-basic. Each shard streams
+                                            # ~1 GB peak via 'datasets' lib.
+                                            # 6 shards x ~1 GB + 30 daemons +
+                                            # Python heap fits comfortably under 16 GB.
+SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}"     # 2 min between shard cycles
 echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
     done
 }
+# Stagger startup 30s apart so memory ramps up gradually — if the OOM killer
+# is going to fire, give earlier shards a chance to settle into steady-state
+# before all peers are loading datasets in parallel.
 for i in $(seq 0 $((NUM_SHARDS - 1))); do
     shard_loop "$i" "$NUM_SHARDS" &
+    sleep 30
 done
 wait