Spaces:
Runtime error
fix: OOM kill on cpu-basic — drop 16 shards to 6 + slower stagger
Browse filesHF runtime API revealed the actual cause of the RUNTIME_ERROR cycles:
errorMessage: 'Memory limit exceeded (16Gi)'
Each shard streams datasets via the 'datasets' library, peaking around
1 GB while a parquet shard downloads + decodes. 16 of those plus 30
co-resident daemons plus Python heap simply does not fit under 16 GB.
Changes:
NUM_SHARDS: 16 -> 6 (~6 GB peak from shards, leaves headroom)
Stagger: 15s -> 30s (gradual ramp; lets early shards settle into
steady-state before later peers spool up)
Trade-off: throughput vs stability. 6 shards still process ~24 datasets
in parallel (each shard owns ~1/6 of the 146-dataset list). With the
filename-collision fix from the last commit (every iter uploads a unique
file), cumulative growth across iterations replaces what raw shard count
gave us.
- bin/bulk-ingest-parallel.sh +10 -4
|
@@ -10,8 +10,12 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
|
| 10 |
LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 11 |
mkdir -p "$(dirname "$LOG")"
|
| 12 |
|
| 13 |
-
NUM_SHARDS="${INGEST_SHARDS:-
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
|
| 16 |
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
|
| 17 |
|
|
@@ -28,9 +32,11 @@ shard_loop() {
|
|
| 28 |
done
|
| 29 |
}
|
| 30 |
|
| 31 |
-
# Stagger startup
|
|
|
|
|
|
|
| 32 |
for i in $(seq 0 $((NUM_SHARDS - 1))); do
|
| 33 |
shard_loop "$i" "$NUM_SHARDS" &
|
| 34 |
-
sleep
|
| 35 |
done
|
| 36 |
wait
|
|
|
|
| 10 |
LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 11 |
mkdir -p "$(dirname "$LOG")"
|
| 12 |
|
| 13 |
+
NUM_SHARDS="${INGEST_SHARDS:-6}" # was 16 — caused Memory limit exceeded (16Gi)
|
| 14 |
+
# on cpu-basic. Each shard streams
|
| 15 |
+
# ~1 GB peak via 'datasets' lib.
|
| 16 |
+
# 6 shards x ~1 GB + 30 daemons +
|
| 17 |
+
# Python heap fits comfortably under 16 GB.
|
| 18 |
+
SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles
|
| 19 |
|
| 20 |
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
|
| 21 |
|
|
|
|
| 32 |
done
|
| 33 |
}
|
| 34 |
|
| 35 |
+
# Stagger startup 30s apart so memory ramps up gradually — if the OOM killer
|
| 36 |
+
# is going to fire, give earlier shards a chance to settle into steady-state
|
| 37 |
+
# before all peers are loading datasets in parallel.
|
| 38 |
for i in $(seq 0 $((NUM_SHARDS - 1))); do
|
| 39 |
shard_loop "$i" "$NUM_SHARDS" &
|
| 40 |
+
sleep 30
|
| 41 |
done
|
| 42 |
wait
|