Ashira Pitchayapakayakul commited on
Commit
99cff37
·
1 Parent(s): e179514

fix: OOM kill on cpu-basic — drop 16 shards to 6 + slower stagger

Browse files

HF runtime API revealed the actual cause of the RUNTIME_ERROR cycles:
errorMessage: 'Memory limit exceeded (16Gi)'

Each shard streams datasets via the 'datasets' library, peaking around
1 GB while a parquet shard downloads + decodes. 16 of those plus 30
co-resident daemons plus Python heap simply does not fit under 16 GB.

Changes:
NUM_SHARDS: 16 -> 6 (~6 GB peak from shards, leaves headroom)
Stagger: 15s -> 30s (gradual ramp; lets early shards settle into
steady-state before later peers spool up)

Trade-off: throughput vs stability. 6 shards still process ~24 datasets
in parallel (each shard owns ~1/6 of the 146-dataset list). With the
filename-collision fix from the last commit (every iter uploads a unique
file), cumulative growth across iterations replaces what raw shard count
gave us.

Files changed (1) hide show
  1. bin/bulk-ingest-parallel.sh +10 -4
bin/bulk-ingest-parallel.sh CHANGED
@@ -10,8 +10,12 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
10
  LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
11
  mkdir -p "$(dirname "$LOG")"
12
 
13
- NUM_SHARDS="${INGEST_SHARDS:-16}"
14
- SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles (was 3)
 
 
 
 
15
 
16
  echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
17
 
@@ -28,9 +32,11 @@ shard_loop() {
28
  done
29
  }
30
 
31
- # Stagger startup 15s apart (was 30s) to spin up faster
 
 
32
  for i in $(seq 0 $((NUM_SHARDS - 1))); do
33
  shard_loop "$i" "$NUM_SHARDS" &
34
- sleep 15
35
  done
36
  wait
 
10
  LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
11
  mkdir -p "$(dirname "$LOG")"
12
 
13
+ NUM_SHARDS="${INGEST_SHARDS:-6}" # was 16 — caused Memory limit exceeded (16Gi)
14
+ # on cpu-basic. Each shard streams
15
+ # ~1 GB peak via 'datasets' lib.
16
+ # 6 shards x ~1 GB + 30 daemons +
17
+ # Python heap fits comfortably under 16 GB.
18
+ SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles
19
 
20
  echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
21
 
 
32
  done
33
  }
34
 
35
+ # Stagger startup 30s apart so memory ramps up gradually — if the OOM killer
36
+ # is going to fire, give earlier shards a chance to settle into steady-state
37
+ # before all peers are loading datasets in parallel.
38
  for i in $(seq 0 $((NUM_SHARDS - 1))); do
39
  shard_loop "$i" "$NUM_SHARDS" &
40
+ sleep 30
41
  done
42
  wait