Ashira Pitchayapakayakul commited on
Commit
7d05ef5
·
1 Parent(s): f89906d

perf(harvest): bump worker fleet 1→16 + push cadence 2× — use the headroom

Browse files

Audit showed each cpu-basic 16GB Space was running only:
BULK_WORKERS=0 + STREAM_WORKERS=1 = 1 worker total
under LOW_MEM=1 (left ~8 GB unused — overly conservative after the
Round 11-12 OOM scare).

Bump:
BULK_WORKERS=0 → 1 (~600 MB)
STREAM_WORKERS=1 → 3 (~500 MB × 3 = 1.5 GB)
total worker fleet/Space: 1 → 4

× 4 harvest Spaces (axentx + shard1 + shard2 + shard3) = **16 workers**
in parallel (was 4). Memory budget per Space:
base ≈6 GB (OS + redis 256mb + continuous-discoverer + dataset-enrich
+ auto-startup-loop + push bursts)
workers ≈2.1 GB (3 stream + 1 bulk)
headroom ≈8 GB → memory-guard.sh trigger threshold (3 GB free) safe

Push cadence: dataset-enrich cron M%60==5 → M%30==5 (every 30 min, was
60). With 4× worker fleet producing chunks faster, the old 60-min push
was leaving harvested data sitting on Space disk too long; 30-min cadence
drains it more eagerly without exceeding HF rate limits.

Throughput projection (current 70-100 commits/hr aggregate):
workers 4× + push cadence 2× = ~3-5× commit rate (HF soft-cap kicks
in past ~120/hr/repo, so realistic ceiling ≈ 250-400 commits/hr aggregate).

Reverts the Round 11-12 over-correction now that:
(a) 4 Spaces fan-out instead of 1 monolith
(b) memory-guard.sh + auto-scaler kill workers if MemAvailable drops
(c) anchor never came up so Space pool is the only harvest tier

LOW_MEM=0 path (paid cpu-upgrade tier) also bumped: 4 stream → 6 stream
(1.5× if user ever upgrades a Space to 32GB).

Files changed (2) hide show
  1. bin/anchor/cron-loop.sh +1 -1
  2. start.sh +25 -9
bin/anchor/cron-loop.sh CHANGED
@@ -108,7 +108,7 @@ while true; do
108
  }
109
  [[ $((M % 30)) -eq 15 ]] && bash "${REPO}/bin/surrogate-research-apply.sh" >>"$LOG" 2>&1 &
110
  [[ $((M % 360)) -eq 30 ]] && bash "${REPO}/bin/surrogate-research-loop.sh" >>"$LOG" 2>&1 &
111
- [[ $((M % 60)) -eq 5 ]] && bash "${REPO}/bin/dataset-enrich.sh" >>"$LOG" 2>&1 &
112
  [[ $((M % 15)) -eq 0 ]] && bash "${REPO}/bin/surrogate-self-ingest.sh" >>"$LOG" 2>&1 &
113
  [[ $((M % 30)) -eq 12 ]] && bash "${REPO}/bin/rag-vector-builder.sh" >>"$LOG" 2>&1 &
114
  [[ $((M % 30)) -eq 7 ]] && bash "${REPO}/bin/synthetic-data-from-rework.sh" >>"$LOG" 2>&1 &
 
108
  }
109
  [[ $((M % 30)) -eq 15 ]] && bash "${REPO}/bin/surrogate-research-apply.sh" >>"$LOG" 2>&1 &
110
  [[ $((M % 360)) -eq 30 ]] && bash "${REPO}/bin/surrogate-research-loop.sh" >>"$LOG" 2>&1 &
111
+ [[ $((M % 30)) -eq 5 ]] && bash "${REPO}/bin/dataset-enrich.sh" >>"$LOG" 2>&1 &
112
  [[ $((M % 15)) -eq 0 ]] && bash "${REPO}/bin/surrogate-self-ingest.sh" >>"$LOG" 2>&1 &
113
  [[ $((M % 30)) -eq 12 ]] && bash "${REPO}/bin/rag-vector-builder.sh" >>"$LOG" 2>&1 &
114
  [[ $((M % 30)) -eq 7 ]] && bash "${REPO}/bin/synthetic-data-from-rework.sh" >>"$LOG" 2>&1 &
start.sh CHANGED
@@ -362,13 +362,26 @@ python3 ~/.surrogate/bin/v2/bulk-mirror-coordinator.py seed >> "$LOG_DIR/bulk-mi
362
  # Two worker types share the same coordinator queue:
363
  # bulk-mirror-worker.sh — full-download, suits small/medium datasets
364
  # streaming-mirror-worker.sh — HF datasets streaming, suits trillion-token
365
- # LOW_MEM tuned for cpu-basic 16GB after Round 9+10 OOM:
366
- # 0 bulk (full-download too heavy) + 2 streaming (lighter) on LOW_MEM
367
- # Anchor handles bulk via 24GB ARM headroom.
368
- # Tighter further after Round 11+12 OOM:
369
- # 1 streaming worker on LOW_MEM (was 2). Anchor takes the slack via 6 workers.
370
- BULK_WORKERS="${BULK_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 0 || echo 4)}"
371
- STREAM_WORKERS="${STREAM_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 1 || echo 4)}"
 
 
 
 
 
 
 
 
 
 
 
 
 
372
 
373
  for i in $(seq 1 "$BULK_WORKERS"); do
374
  nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
@@ -434,8 +447,11 @@ while true; do
434
  [[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
435
  # Every 6 hours: research-loop (discover new features from competitors/papers)
436
  [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
437
- # Every 60 min: dataset enrich. Memory-guardedfull HF Hub iter is heavy.
438
- [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
 
 
 
439
  && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
440
  # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
441
  [[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
 
362
  # Two worker types share the same coordinator queue:
363
  # bulk-mirror-worker.sh — full-download, suits small/medium datasets
364
  # streaming-mirror-worker.sh — HF datasets streaming, suits trillion-token
365
+ # LOW_MEM tuning for cpu-basic 16GB Space (history):
366
+ # v1: 0 bulk + 2 stream (Round 9-10 OOM tightened to 0+2)
367
+ # v2: 0 bulk + 1 stream (Round 11-12 OOM further tightened)
368
+ # v3 NOW: 1 bulk + 3 stream (post Civo-pivot + 4-Space fan-out;
369
+ # anchor never came up so we can't rely on
370
+ # it for bulk, and 16GB has ~8 GB unused
371
+ # under the v2 setting reclaim it)
372
+ #
373
+ # Memory budget per Space (16 GB cpu-basic):
374
+ # ~6 GB reserved: OS + redis 256mb + continuous-discoverer +
375
+ # dataset-enrich + auto-startup-loop + push bursts
376
+ # ~10 GB available for harvest workers
377
+ # 3 stream × 500 MB + 1 bulk × 600 MB = 2.1 GB used
378
+ # ~8 GB headroom → memory-guard.sh kicks in at <3 GB free, safe
379
+ #
380
+ # Throughput delta: 4× workers/Space × 4 Spaces = 16× total worker count
381
+ # (vs previous 1×4 = 4). Combined with enrich cron M%30==5 (was M%60),
382
+ # expect 3-5× commit rate before HF soft-cap kicks in.
383
+ BULK_WORKERS="${BULK_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 1 || echo 4)}"
384
+ STREAM_WORKERS="${STREAM_WORKERS:-$([[ "$LOW_MEM" == "1" ]] && echo 3 || echo 6)}"
385
 
386
  for i in $(seq 1 "$BULK_WORKERS"); do
387
  nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
 
447
  [[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
448
  # Every 6 hours: research-loop (discover new features from competitors/papers)
449
  [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
450
+ # Every 30 min: dataset enrich (was 60 min bumped now that we have
451
+ # 4 Spaces × (3 stream + 1 bulk) = 16 workers harvesting in parallel,
452
+ # producing more chunks per hour than the old 60-min push could drain).
453
+ # Memory-guarded — full HF Hub iter is heavy.
454
+ [[ $((M % 30)) -eq 5 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
455
  && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
456
  # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
457
  [[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &