Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 10 days ago

Commit

7d05ef5

1 Parent(s): f89906d

perf(harvest): bump worker fleet 1→16 + push cadence 2× — use the headroom

Audit showed each cpu-basic 16GB Space was running only:
BULK_WORKERS=0 + STREAM_WORKERS=1 = 1 worker total
under LOW_MEM=1 (left ~8 GB unused — overly conservative after the
Round 11-12 OOM scare).

Bump:
BULK_WORKERS=0 → 1 (~600 MB)
STREAM_WORKERS=1 → 3 (~500 MB × 3 = 1.5 GB)
total worker fleet/Space: 1 → 4

× 4 harvest Spaces (axentx + shard1 + shard2 + shard3) = **16 workers**
in parallel (was 4). Memory budget per Space:
base ≈6 GB (OS + redis 256mb + continuous-discoverer + dataset-enrich
+ auto-startup-loop + push bursts)
workers ≈2.1 GB (3 stream + 1 bulk)
headroom ≈8 GB → memory-guard.sh trigger threshold (3 GB free) safe

Push cadence: dataset-enrich cron M%60==5 → M%30==5 (every 30 min, was
60). With 4× worker fleet producing chunks faster, the old 60-min push
was leaving harvested data sitting on Space disk too long; 30-min cadence
drains it more eagerly without exceeding HF rate limits.

Throughput projection (current 70-100 commits/hr aggregate):
workers 4× + push cadence 2× = ~3-5× commit rate (HF soft-cap kicks
in past ~120/hr/repo, so realistic ceiling ≈ 250-400 commits/hr aggregate).

Reverts the Round 11-12 over-correction now that:
(a) 4 Spaces fan-out instead of 1 monolith
(b) memory-guard.sh + auto-scaler kill workers if MemAvailable drops
(c) anchor never came up so Space pool is the only harvest tier

LOW_MEM=0 path (paid cpu-upgrade tier) also bumped: 4 stream → 6 stream
(1.5× if user ever upgrades a Space to 32GB).

Files changed (2) hide show

bin/anchor/cron-loop.sh +1 -1
start.sh +25 -9

bin/anchor/cron-loop.sh CHANGED Viewed

@@ -108,7 +108,7 @@ while true; do
     }
     [[ $((M % 30))  -eq 15 ]]  && bash "${REPO}/bin/surrogate-research-apply.sh" >>"$LOG" 2>&1 &
     [[ $((M % 360)) -eq 30 ]]  && bash "${REPO}/bin/surrogate-research-loop.sh"  >>"$LOG" 2>&1 &
-    [[ $((M % 60))  -eq 5 ]]   && bash "${REPO}/bin/dataset-enrich.sh"           >>"$LOG" 2>&1 &
     [[ $((M % 15))  -eq 0 ]]   && bash "${REPO}/bin/surrogate-self-ingest.sh"    >>"$LOG" 2>&1 &
     [[ $((M % 30))  -eq 12 ]]  && bash "${REPO}/bin/rag-vector-builder.sh"       >>"$LOG" 2>&1 &
     [[ $((M % 30))  -eq 7 ]]   && bash "${REPO}/bin/synthetic-data-from-rework.sh" >>"$LOG" 2>&1 &

     }
     [[ $((M % 30))  -eq 15 ]]  && bash "${REPO}/bin/surrogate-research-apply.sh" >>"$LOG" 2>&1 &
     [[ $((M % 360)) -eq 30 ]]  && bash "${REPO}/bin/surrogate-research-loop.sh"  >>"$LOG" 2>&1 &
+    [[ $((M % 30))  -eq 5 ]]   && bash "${REPO}/bin/dataset-enrich.sh"           >>"$LOG" 2>&1 &
     [[ $((M % 15))  -eq 0 ]]   && bash "${REPO}/bin/surrogate-self-ingest.sh"    >>"$LOG" 2>&1 &
     [[ $((M % 30))  -eq 12 ]]  && bash "${REPO}/bin/rag-vector-builder.sh"       >>"$LOG" 2>&1 &
     [[ $((M % 30))  -eq 7 ]]   && bash "${REPO}/bin/synthetic-data-from-rework.sh" >>"$LOG" 2>&1 &

start.sh CHANGED Viewed

@@ -362,13 +362,26 @@ python3 ~/.surrogate/bin/v2/bulk-mirror-coordinator.py seed >> "$LOG_DIR/bulk-mi
 # Two worker types share the same coordinator queue:
 #   bulk-mirror-worker.sh    — full-download, suits small/medium datasets
 #   streaming-mirror-worker.sh — HF datasets streaming, suits trillion-token
-# LOW_MEM tuned for cpu-basic 16GB after Round 9+10 OOM:
-# 0 bulk (full-download too heavy) + 2 streaming (lighter) on LOW_MEM
-# Anchor handles bulk via 24GB ARM headroom.
-# Tighter further after Round 11+12 OOM:
-# 1 streaming worker on LOW_MEM (was 2). Anchor takes the slack via 6 workers.
-BULK_WORKERS="${BULK_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 0 || echo 4)}"
-STREAM_WORKERS="${STREAM_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 1 || echo 4)}"
 for i in $(seq 1 "$BULK_WORKERS"); do
     nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
@@ -434,8 +447,11 @@ while true; do
     [[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
     # Every 6 hours: research-loop (discover new features from competitors/papers)
     [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
-    # Every 60 min: dataset enrich. Memory-guarded — full HF Hub iter is heavy.
-    [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
         && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
     # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
     [[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &

 # Two worker types share the same coordinator queue:
 #   bulk-mirror-worker.sh    — full-download, suits small/medium datasets
 #   streaming-mirror-worker.sh — HF datasets streaming, suits trillion-token
+# LOW_MEM tuning for cpu-basic 16GB Space (history):
+#   v1: 0 bulk + 2 stream (Round 9-10 OOM tightened to 0+2)
+#   v2: 0 bulk + 1 stream (Round 11-12 OOM further tightened)
+#   v3 NOW: 1 bulk + 3 stream  (post Civo-pivot + 4-Space fan-out;
+#                                anchor never came up so we can't rely on
+#                                it for bulk, and 16GB has ~8 GB unused
+#                                under the v2 setting → reclaim it)
+#
+# Memory budget per Space (16 GB cpu-basic):
+#   ~6 GB reserved: OS + redis 256mb + continuous-discoverer +
+#                    dataset-enrich + auto-startup-loop + push bursts
+#   ~10 GB available for harvest workers
+#   3 stream × 500 MB + 1 bulk × 600 MB = 2.1 GB used
+#   ~8 GB headroom → memory-guard.sh kicks in at <3 GB free, safe
+#
+# Throughput delta: 4× workers/Space × 4 Spaces = 16× total worker count
+# (vs previous 1×4 = 4). Combined with enrich cron M%30==5 (was M%60),
+# expect 3-5× commit rate before HF soft-cap kicks in.
+BULK_WORKERS="${BULK_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 1 || echo 4)}"
+STREAM_WORKERS="${STREAM_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 3 || echo 6)}"
 for i in $(seq 1 "$BULK_WORKERS"); do
     nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
     [[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
     # Every 6 hours: research-loop (discover new features from competitors/papers)
     [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
+    # Every 30 min: dataset enrich (was 60 min — bumped 2× now that we have
+    # 4 Spaces × (3 stream + 1 bulk) = 16 workers harvesting in parallel,
+    # producing more chunks per hour than the old 60-min push could drain).
+    # Memory-guarded — full HF Hub iter is heavy.
+    [[ $((M % 30)) -eq 5 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
         && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
     # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
     [[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &