Ashira Pitchayapakayakul commited on
Commit
d8d7a71
Β·
1 Parent(s): 3234167

fix(oom-permanent): memory guard + redis cap + heavy-task gating

Browse files

User: 'ΰΈΰΉ‡ΰΉΰΈΰΉ‰ΰΉƒΰΈ«ΰΉ‰ΰΉ„ΰΈ‘ΰΉˆΰΈ•ΰΈ²ΰΈ’ΰΈͺΰΈ΄ ΰΈ•ΰΉ‰ΰΈ­ΰΈ‡ΰΈšΰΈ­ΰΈΰΉ€ΰΈ«ΰΈ£ΰΈ­' β€” fix it permanently.

bin/v2/memory-guard.sh:
β€’ Returns 0 if /proc/meminfo MemAvailable >= MIN_FREE_MB (default 3GB)
β€’ Returns 1 if pressure β†’ wraps cron entries to skip when memory low
β€’ Logs skip events to ~/.surrogate/logs/memory-guard.log
β€’ macOS vm_stat fallback for anchor (which runs locally before deploy)

start.sh changes:
β€’ redis maxmemory: 1gb β†’ 256mb on LOW_MEM (coordinator uses SQLite anyway,
redis only soft work-queue cache). Saves ~750 MB.
β€’ Heavy cron entries now gated by memory-guard:
- push-training-to-hf (loads 866K-pair shards = 500MB-1GB peak)
- dataset-enrich (HF Hub iter)
- synthetic-data-from-rework (LLM ladder calls)
- self-improve-loop (problem gen + judge)
- enrich-pipeline (re-process bulk-mirror outputs)

Net effect on cpu-basic:
β€’ Stagger fix (3234167) handles concurrency timing
β€’ Memory guard handles concurrency MAGNITUDE β€” even if 5 things fire
same minute, the heavy ones skip if RAM tight
β€’ Redis cap saves 750 MB β†’ permanent headroom
β€’ Total memory floor ~2.5 GB (was 3.5 GB) β†’ 13.5 GB headroom for cron bursts

Anchor (when up) takes the heavy work via 24 GB ARM. Space stays light.

Files changed (2) hide show
  1. bin/v2/memory-guard.sh +35 -0
  2. start.sh +17 -8
bin/v2/memory-guard.sh ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 v2 β€” memory-pressure guard for cron entries.
3
+ #
4
+ # Usage in cron line:
5
+ # bash memory-guard.sh && bash heavy-task.sh
6
+ # # ↑ heavy-task only runs if free memory >= MIN_FREE_MB
7
+ #
8
+ # Default threshold 3 GB (3072 MB). Adjust via $MIN_FREE_MB env.
9
+ # Returns 0 (proceed) if enough free memory; 1 (skip) if pressure.
10
+ set -uo pipefail
11
+
12
+ MIN_FREE_MB="${MIN_FREE_MB:-3072}"
13
+
14
+ # Linux container (HF Space) /proc/meminfo
15
+ if [[ -r /proc/meminfo ]]; then
16
+ AVAIL=$(awk '/^MemAvailable:/{print int($2/1024)}' /proc/meminfo)
17
+ elif command -v vm_stat >/dev/null 2>&1; then
18
+ # macOS fallback (used by anchor cron-loop running locally on dev)
19
+ PAGES_FREE=$(vm_stat | awk '/Pages free/{gsub("\\.","",$3); print $3}')
20
+ PAGE_SIZE=$(vm_stat | awk '/page size of/{print $8}')
21
+ AVAIL=$(( PAGES_FREE * PAGE_SIZE / 1048576 ))
22
+ else
23
+ # Unknown β€” assume OK
24
+ AVAIL=99999
25
+ fi
26
+
27
+ if (( AVAIL >= MIN_FREE_MB )); then
28
+ exit 0 # proceed
29
+ fi
30
+
31
+ # Pressure β€” log + skip
32
+ LOG="${MEMORY_GUARD_LOG:-${HOME}/.surrogate/logs/memory-guard.log}"
33
+ mkdir -p "$(dirname "$LOG")" 2>/dev/null || true
34
+ echo "[$(date '+%H:%M:%S')] SKIP β€” avail=${AVAIL}MB < ${MIN_FREE_MB}MB threshold" >> "$LOG" 2>/dev/null || true
35
+ exit 1
start.sh CHANGED
@@ -192,8 +192,11 @@ if [[ -n "$GH_TOKEN" ]]; then
192
  fi
193
 
194
  # ── 4. Redis (TCP only) ─────────────────────────────────────────────────────
 
 
 
195
  redis-server --daemonize yes --port 6379 --bind 127.0.0.1 \
196
- --maxmemory 1gb --maxmemory-policy allkeys-lru
197
  sleep 1
198
  redis-cli -h 127.0.0.1 -p 6379 ping >> "$LOG_DIR/redis.log" 2>&1
199
 
@@ -394,7 +397,10 @@ while true; do
394
  # Each major task picks a unique M%X==N offset so no two fire together.
395
  [[ $((M % 2)) -eq 1 ]] && bash ~/.surrogate/bin/surrogate-dev-loop.sh 1 >> "$LOG" 2>&1 &
396
  [[ $((M % 5)) -eq 2 ]] && bash ~/.surrogate/bin/work-queue-producer.sh >> "$LOG" 2>&1 &
397
- [[ $((M % 3)) -eq 1 ]] && bash ~/.surrogate/bin/push-training-to-hf.sh >> "$LOG" 2>&1 &
 
 
 
398
  # auto-orchestrate now runs CONTINUOUSLY (4 parallel workers) β€” see step 7e below.
399
  # Cron entry retained for legacy single-fire boost (no harm if continuous already up):
400
  [[ $((M % 20)) -eq 0 ]] && pgrep -f "auto-orchestrate-continuous" >/dev/null || bash ~/.surrogate/bin/auto-orchestrate-loop.sh >> "$LOG" 2>&1 &
@@ -404,15 +410,16 @@ while true; do
404
  [[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
405
  # Every 6 hours: research-loop (discover new features from competitors/papers)
406
  [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
407
- # Every 60 min: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
408
- # (was 4h β€” accelerated to drain 96-dataset queue ASAP per user request)
409
- [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
410
  # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
411
  [[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
412
  # Every 30 min: build vector embeddings index (RAG semantic search)
413
  [[ $((M % 30)) -eq 12 ]] && bash ~/.surrogate/bin/rag-vector-builder.sh >> "$LOG" 2>&1 &
414
  # Every 30 min: synthetic data generation (REWORKβ†’APPROVE DPO + distilabel rewrite)
415
- [[ $((M % 30)) -eq 7 ]] && bash ~/.surrogate/bin/synthetic-data-from-rework.sh >> "$LOG" 2>&1 &
 
416
  # Daily 04:00 UTC: refresh CVE feed (NVD + CISA KEV) β†’ security-knowledge dataset
417
  [[ $((M % 1440)) -eq 240 ]] && bash ~/.surrogate/bin/refresh-cve-feed.sh >> "$LOG" 2>&1 &
418
  # Daily 05:00 UTC: scrape SRE postmortems (danluu list + awesome-tech-postmortems)
@@ -434,7 +441,8 @@ while true; do
434
  # ── Round 5 (2026-04) sustainability loops ──────────────────────────
435
  # Every 6 hr (offset 90): self-improve loop β€” gen problems, judge,
436
  # winners β†’ training data, losers β†’ reflexion-store.
437
- [[ $((M % 360)) -eq 90 ]] && bash ~/.surrogate/bin/v2/self-improve-loop.sh >> "$LOG_DIR/self-improve.log" 2>&1 &
 
438
  # Every 30 min (offset 22): mine new tool-call traces from logs into
439
  # SFT + DPO data, plus voyager skill candidates.
440
  [[ $((M % 30)) -eq 22 ]] && python3 ~/.surrogate/bin/v2/tool-trace-collector.py >> "$LOG_DIR/tool-trace.log" 2>&1 &
@@ -463,7 +471,8 @@ while true; do
463
  [[ $((M % 30)) -eq 9 ]] && bash ~/.surrogate/bin/v2/aggressive-harvester.sh \
464
  >> "$LOG_DIR/aggressive-harvester.log" 2>&1 &
465
  # Every 60 min (offset 35): enrich newly-mirrored bulk files
466
- [[ $((M % 60)) -eq 35 ]] && bash ~/.surrogate/bin/v2/enrich-pipeline.sh \
 
467
  >> "$LOG_DIR/enrich-pipeline.log" 2>&1 &
468
  # Every 30 min (offset 25): spawn extra streaming worker if pool empty
469
  [[ $((M % 30)) -eq 25 ]] && {
 
192
  fi
193
 
194
  # ── 4. Redis (TCP only) ─────────────────────────────────────────────────────
195
+ # redis cap tightened on LOW_MEM (was 1gb β†’ 256mb). Coordinator uses
196
+ # SQLite directly; redis is only a soft cache for work-queue priorities.
197
+ REDIS_MAX="${REDIS_MAX:-$([[ "$LOW_MEM" == "1" ]] && echo "256mb" || echo "1gb")}"
198
  redis-server --daemonize yes --port 6379 --bind 127.0.0.1 \
199
+ --maxmemory "$REDIS_MAX" --maxmemory-policy allkeys-lru
200
  sleep 1
201
  redis-cli -h 127.0.0.1 -p 6379 ping >> "$LOG_DIR/redis.log" 2>&1
202
 
 
397
  # Each major task picks a unique M%X==N offset so no two fire together.
398
  [[ $((M % 2)) -eq 1 ]] && bash ~/.surrogate/bin/surrogate-dev-loop.sh 1 >> "$LOG" 2>&1 &
399
  [[ $((M % 5)) -eq 2 ]] && bash ~/.surrogate/bin/work-queue-producer.sh >> "$LOG" 2>&1 &
400
+ # push-training-to-hf gated by memory (loads big shard into RAM).
401
+ # Anchor (24GB) takes over when capacity arrives β€” see anchor cron-loop.
402
+ [[ $((M % 3)) -eq 1 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
403
+ && bash ~/.surrogate/bin/push-training-to-hf.sh >> "$LOG" 2>&1 &
404
  # auto-orchestrate now runs CONTINUOUSLY (4 parallel workers) β€” see step 7e below.
405
  # Cron entry retained for legacy single-fire boost (no harm if continuous already up):
406
  [[ $((M % 20)) -eq 0 ]] && pgrep -f "auto-orchestrate-continuous" >/dev/null || bash ~/.surrogate/bin/auto-orchestrate-loop.sh >> "$LOG" 2>&1 &
 
410
  [[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
411
  # Every 6 hours: research-loop (discover new features from competitors/papers)
412
  [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
413
+ # Every 60 min: dataset enrich. Memory-guarded β€” full HF Hub iter is heavy.
414
+ [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
415
+ && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
416
  # Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
417
  [[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
418
  # Every 30 min: build vector embeddings index (RAG semantic search)
419
  [[ $((M % 30)) -eq 12 ]] && bash ~/.surrogate/bin/rag-vector-builder.sh >> "$LOG" 2>&1 &
420
  # Every 30 min: synthetic data generation (REWORKβ†’APPROVE DPO + distilabel rewrite)
421
+ [[ $((M % 30)) -eq 7 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
422
+ && bash ~/.surrogate/bin/synthetic-data-from-rework.sh >> "$LOG" 2>&1 &
423
  # Daily 04:00 UTC: refresh CVE feed (NVD + CISA KEV) β†’ security-knowledge dataset
424
  [[ $((M % 1440)) -eq 240 ]] && bash ~/.surrogate/bin/refresh-cve-feed.sh >> "$LOG" 2>&1 &
425
  # Daily 05:00 UTC: scrape SRE postmortems (danluu list + awesome-tech-postmortems)
 
441
  # ── Round 5 (2026-04) sustainability loops ──────────────────────────
442
  # Every 6 hr (offset 90): self-improve loop β€” gen problems, judge,
443
  # winners β†’ training data, losers β†’ reflexion-store.
444
+ [[ $((M % 360)) -eq 90 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
445
+ && bash ~/.surrogate/bin/v2/self-improve-loop.sh >> "$LOG_DIR/self-improve.log" 2>&1 &
446
  # Every 30 min (offset 22): mine new tool-call traces from logs into
447
  # SFT + DPO data, plus voyager skill candidates.
448
  [[ $((M % 30)) -eq 22 ]] && python3 ~/.surrogate/bin/v2/tool-trace-collector.py >> "$LOG_DIR/tool-trace.log" 2>&1 &
 
471
  [[ $((M % 30)) -eq 9 ]] && bash ~/.surrogate/bin/v2/aggressive-harvester.sh \
472
  >> "$LOG_DIR/aggressive-harvester.log" 2>&1 &
473
  # Every 60 min (offset 35): enrich newly-mirrored bulk files
474
+ [[ $((M % 60)) -eq 35 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
475
+ && bash ~/.surrogate/bin/v2/enrich-pipeline.sh \
476
  >> "$LOG_DIR/enrich-pipeline.log" 2>&1 &
477
  # Every 30 min (offset 25): spawn extra streaming worker if pool empty
478
  [[ $((M % 30)) -eq 25 ]] && {