Spaces:
Runtime error
fix(oom-permanent): memory guard + redis cap + heavy-task gating
Browse filesUser: 'ΰΈΰΉΰΉΰΈΰΉΰΉΰΈ«ΰΉΰΉΰΈ‘ΰΉΰΈΰΈ²ΰΈ’ΰΈͺΰΈ΄ ΰΈΰΉΰΈΰΈΰΈΰΈΰΈΰΉΰΈ«ΰΈ£ΰΈ' β fix it permanently.
bin/v2/memory-guard.sh:
β’ Returns 0 if /proc/meminfo MemAvailable >= MIN_FREE_MB (default 3GB)
β’ Returns 1 if pressure β wraps cron entries to skip when memory low
β’ Logs skip events to ~/.surrogate/logs/memory-guard.log
β’ macOS vm_stat fallback for anchor (which runs locally before deploy)
start.sh changes:
β’ redis maxmemory: 1gb β 256mb on LOW_MEM (coordinator uses SQLite anyway,
redis only soft work-queue cache). Saves ~750 MB.
β’ Heavy cron entries now gated by memory-guard:
- push-training-to-hf (loads 866K-pair shards = 500MB-1GB peak)
- dataset-enrich (HF Hub iter)
- synthetic-data-from-rework (LLM ladder calls)
- self-improve-loop (problem gen + judge)
- enrich-pipeline (re-process bulk-mirror outputs)
Net effect on cpu-basic:
β’ Stagger fix (3234167) handles concurrency timing
β’ Memory guard handles concurrency MAGNITUDE β even if 5 things fire
same minute, the heavy ones skip if RAM tight
β’ Redis cap saves 750 MB β permanent headroom
β’ Total memory floor ~2.5 GB (was 3.5 GB) β 13.5 GB headroom for cron bursts
Anchor (when up) takes the heavy work via 24 GB ARM. Space stays light.
- bin/v2/memory-guard.sh +35 -0
- start.sh +17 -8
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 v2 β memory-pressure guard for cron entries.
|
| 3 |
+
#
|
| 4 |
+
# Usage in cron line:
|
| 5 |
+
# bash memory-guard.sh && bash heavy-task.sh
|
| 6 |
+
# # β heavy-task only runs if free memory >= MIN_FREE_MB
|
| 7 |
+
#
|
| 8 |
+
# Default threshold 3 GB (3072 MB). Adjust via $MIN_FREE_MB env.
|
| 9 |
+
# Returns 0 (proceed) if enough free memory; 1 (skip) if pressure.
|
| 10 |
+
set -uo pipefail
|
| 11 |
+
|
| 12 |
+
MIN_FREE_MB="${MIN_FREE_MB:-3072}"
|
| 13 |
+
|
| 14 |
+
# Linux container (HF Space) /proc/meminfo
|
| 15 |
+
if [[ -r /proc/meminfo ]]; then
|
| 16 |
+
AVAIL=$(awk '/^MemAvailable:/{print int($2/1024)}' /proc/meminfo)
|
| 17 |
+
elif command -v vm_stat >/dev/null 2>&1; then
|
| 18 |
+
# macOS fallback (used by anchor cron-loop running locally on dev)
|
| 19 |
+
PAGES_FREE=$(vm_stat | awk '/Pages free/{gsub("\\.","",$3); print $3}')
|
| 20 |
+
PAGE_SIZE=$(vm_stat | awk '/page size of/{print $8}')
|
| 21 |
+
AVAIL=$(( PAGES_FREE * PAGE_SIZE / 1048576 ))
|
| 22 |
+
else
|
| 23 |
+
# Unknown β assume OK
|
| 24 |
+
AVAIL=99999
|
| 25 |
+
fi
|
| 26 |
+
|
| 27 |
+
if (( AVAIL >= MIN_FREE_MB )); then
|
| 28 |
+
exit 0 # proceed
|
| 29 |
+
fi
|
| 30 |
+
|
| 31 |
+
# Pressure β log + skip
|
| 32 |
+
LOG="${MEMORY_GUARD_LOG:-${HOME}/.surrogate/logs/memory-guard.log}"
|
| 33 |
+
mkdir -p "$(dirname "$LOG")" 2>/dev/null || true
|
| 34 |
+
echo "[$(date '+%H:%M:%S')] SKIP β avail=${AVAIL}MB < ${MIN_FREE_MB}MB threshold" >> "$LOG" 2>/dev/null || true
|
| 35 |
+
exit 1
|
|
@@ -192,8 +192,11 @@ if [[ -n "$GH_TOKEN" ]]; then
|
|
| 192 |
fi
|
| 193 |
|
| 194 |
# ββ 4. Redis (TCP only) βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
| 195 |
redis-server --daemonize yes --port 6379 --bind 127.0.0.1 \
|
| 196 |
-
--maxmemory
|
| 197 |
sleep 1
|
| 198 |
redis-cli -h 127.0.0.1 -p 6379 ping >> "$LOG_DIR/redis.log" 2>&1
|
| 199 |
|
|
@@ -394,7 +397,10 @@ while true; do
|
|
| 394 |
# Each major task picks a unique M%X==N offset so no two fire together.
|
| 395 |
[[ $((M % 2)) -eq 1 ]] && bash ~/.surrogate/bin/surrogate-dev-loop.sh 1 >> "$LOG" 2>&1 &
|
| 396 |
[[ $((M % 5)) -eq 2 ]] && bash ~/.surrogate/bin/work-queue-producer.sh >> "$LOG" 2>&1 &
|
| 397 |
-
|
|
|
|
|
|
|
|
|
|
| 398 |
# auto-orchestrate now runs CONTINUOUSLY (4 parallel workers) β see step 7e below.
|
| 399 |
# Cron entry retained for legacy single-fire boost (no harm if continuous already up):
|
| 400 |
[[ $((M % 20)) -eq 0 ]] && pgrep -f "auto-orchestrate-continuous" >/dev/null || bash ~/.surrogate/bin/auto-orchestrate-loop.sh >> "$LOG" 2>&1 &
|
|
@@ -404,15 +410,16 @@ while true; do
|
|
| 404 |
[[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
|
| 405 |
# Every 6 hours: research-loop (discover new features from competitors/papers)
|
| 406 |
[[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
|
| 407 |
-
# Every 60 min: dataset enrich
|
| 408 |
-
|
| 409 |
-
|
| 410 |
# Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
|
| 411 |
[[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
|
| 412 |
# Every 30 min: build vector embeddings index (RAG semantic search)
|
| 413 |
[[ $((M % 30)) -eq 12 ]] && bash ~/.surrogate/bin/rag-vector-builder.sh >> "$LOG" 2>&1 &
|
| 414 |
# Every 30 min: synthetic data generation (REWORKβAPPROVE DPO + distilabel rewrite)
|
| 415 |
-
[[ $((M % 30)) -eq 7 ]] && bash ~/.surrogate/bin/
|
|
|
|
| 416 |
# Daily 04:00 UTC: refresh CVE feed (NVD + CISA KEV) β security-knowledge dataset
|
| 417 |
[[ $((M % 1440)) -eq 240 ]] && bash ~/.surrogate/bin/refresh-cve-feed.sh >> "$LOG" 2>&1 &
|
| 418 |
# Daily 05:00 UTC: scrape SRE postmortems (danluu list + awesome-tech-postmortems)
|
|
@@ -434,7 +441,8 @@ while true; do
|
|
| 434 |
# ββ Round 5 (2026-04) sustainability loops ββββββββββββββββββββββββββ
|
| 435 |
# Every 6 hr (offset 90): self-improve loop β gen problems, judge,
|
| 436 |
# winners β training data, losers β reflexion-store.
|
| 437 |
-
[[ $((M % 360)) -eq 90 ]] && bash ~/.surrogate/bin/v2/
|
|
|
|
| 438 |
# Every 30 min (offset 22): mine new tool-call traces from logs into
|
| 439 |
# SFT + DPO data, plus voyager skill candidates.
|
| 440 |
[[ $((M % 30)) -eq 22 ]] && python3 ~/.surrogate/bin/v2/tool-trace-collector.py >> "$LOG_DIR/tool-trace.log" 2>&1 &
|
|
@@ -463,7 +471,8 @@ while true; do
|
|
| 463 |
[[ $((M % 30)) -eq 9 ]] && bash ~/.surrogate/bin/v2/aggressive-harvester.sh \
|
| 464 |
>> "$LOG_DIR/aggressive-harvester.log" 2>&1 &
|
| 465 |
# Every 60 min (offset 35): enrich newly-mirrored bulk files
|
| 466 |
-
[[ $((M % 60)) -eq 35 ]] && bash ~/.surrogate/bin/v2/
|
|
|
|
| 467 |
>> "$LOG_DIR/enrich-pipeline.log" 2>&1 &
|
| 468 |
# Every 30 min (offset 25): spawn extra streaming worker if pool empty
|
| 469 |
[[ $((M % 30)) -eq 25 ]] && {
|
|
|
|
| 192 |
fi
|
| 193 |
|
| 194 |
# ββ 4. Redis (TCP only) βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 195 |
+
# redis cap tightened on LOW_MEM (was 1gb β 256mb). Coordinator uses
|
| 196 |
+
# SQLite directly; redis is only a soft cache for work-queue priorities.
|
| 197 |
+
REDIS_MAX="${REDIS_MAX:-$([[ "$LOW_MEM" == "1" ]] && echo "256mb" || echo "1gb")}"
|
| 198 |
redis-server --daemonize yes --port 6379 --bind 127.0.0.1 \
|
| 199 |
+
--maxmemory "$REDIS_MAX" --maxmemory-policy allkeys-lru
|
| 200 |
sleep 1
|
| 201 |
redis-cli -h 127.0.0.1 -p 6379 ping >> "$LOG_DIR/redis.log" 2>&1
|
| 202 |
|
|
|
|
| 397 |
# Each major task picks a unique M%X==N offset so no two fire together.
|
| 398 |
[[ $((M % 2)) -eq 1 ]] && bash ~/.surrogate/bin/surrogate-dev-loop.sh 1 >> "$LOG" 2>&1 &
|
| 399 |
[[ $((M % 5)) -eq 2 ]] && bash ~/.surrogate/bin/work-queue-producer.sh >> "$LOG" 2>&1 &
|
| 400 |
+
# push-training-to-hf gated by memory (loads big shard into RAM).
|
| 401 |
+
# Anchor (24GB) takes over when capacity arrives β see anchor cron-loop.
|
| 402 |
+
[[ $((M % 3)) -eq 1 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
|
| 403 |
+
&& bash ~/.surrogate/bin/push-training-to-hf.sh >> "$LOG" 2>&1 &
|
| 404 |
# auto-orchestrate now runs CONTINUOUSLY (4 parallel workers) β see step 7e below.
|
| 405 |
# Cron entry retained for legacy single-fire boost (no harm if continuous already up):
|
| 406 |
[[ $((M % 20)) -eq 0 ]] && pgrep -f "auto-orchestrate-continuous" >/dev/null || bash ~/.surrogate/bin/auto-orchestrate-loop.sh >> "$LOG" 2>&1 &
|
|
|
|
| 410 |
[[ $((M % 60)) -eq 4 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
|
| 411 |
# Every 6 hours: research-loop (discover new features from competitors/papers)
|
| 412 |
[[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
|
| 413 |
+
# Every 60 min: dataset enrich. Memory-guarded β full HF Hub iter is heavy.
|
| 414 |
+
[[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
|
| 415 |
+
&& bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
|
| 416 |
# Every 15 min: self-ingest training-pairs into FTS index (closes self-improvement)
|
| 417 |
[[ $((M % 15)) -eq 3 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
|
| 418 |
# Every 30 min: build vector embeddings index (RAG semantic search)
|
| 419 |
[[ $((M % 30)) -eq 12 ]] && bash ~/.surrogate/bin/rag-vector-builder.sh >> "$LOG" 2>&1 &
|
| 420 |
# Every 30 min: synthetic data generation (REWORKβAPPROVE DPO + distilabel rewrite)
|
| 421 |
+
[[ $((M % 30)) -eq 7 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
|
| 422 |
+
&& bash ~/.surrogate/bin/synthetic-data-from-rework.sh >> "$LOG" 2>&1 &
|
| 423 |
# Daily 04:00 UTC: refresh CVE feed (NVD + CISA KEV) β security-knowledge dataset
|
| 424 |
[[ $((M % 1440)) -eq 240 ]] && bash ~/.surrogate/bin/refresh-cve-feed.sh >> "$LOG" 2>&1 &
|
| 425 |
# Daily 05:00 UTC: scrape SRE postmortems (danluu list + awesome-tech-postmortems)
|
|
|
|
| 441 |
# ββ Round 5 (2026-04) sustainability loops ββββββββββββββββββββββββββ
|
| 442 |
# Every 6 hr (offset 90): self-improve loop β gen problems, judge,
|
| 443 |
# winners β training data, losers β reflexion-store.
|
| 444 |
+
[[ $((M % 360)) -eq 90 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
|
| 445 |
+
&& bash ~/.surrogate/bin/v2/self-improve-loop.sh >> "$LOG_DIR/self-improve.log" 2>&1 &
|
| 446 |
# Every 30 min (offset 22): mine new tool-call traces from logs into
|
| 447 |
# SFT + DPO data, plus voyager skill candidates.
|
| 448 |
[[ $((M % 30)) -eq 22 ]] && python3 ~/.surrogate/bin/v2/tool-trace-collector.py >> "$LOG_DIR/tool-trace.log" 2>&1 &
|
|
|
|
| 471 |
[[ $((M % 30)) -eq 9 ]] && bash ~/.surrogate/bin/v2/aggressive-harvester.sh \
|
| 472 |
>> "$LOG_DIR/aggressive-harvester.log" 2>&1 &
|
| 473 |
# Every 60 min (offset 35): enrich newly-mirrored bulk files
|
| 474 |
+
[[ $((M % 60)) -eq 35 ]] && bash ~/.surrogate/bin/v2/memory-guard.sh \
|
| 475 |
+
&& bash ~/.surrogate/bin/v2/enrich-pipeline.sh \
|
| 476 |
>> "$LOG_DIR/enrich-pipeline.log" 2>&1 &
|
| 477 |
# Every 30 min (offset 25): spawn extra streaming worker if pool empty
|
| 478 |
[[ $((M % 30)) -eq 25 ]] && {
|