Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 10 days ago

Commit

859b78d

1 Parent(s): 99cff37

feat: self-heal-watchdog + drop shards 6→4 to dodge OOM kill

Monitor caught a second 'Memory limit exceeded (16Gi)' even with shards
already reduced to 6. Two changes to break the OOM-restart cycle:

1. NEW: bin/self-heal-watchdog.sh — runs on the Space, ticks every 60s,
and acts on three failure modes that previously needed manual fixes:
* mem >= 85% -> SIGTERM the youngest dataset-enrich shard. Shard
loop's cooldown then respawns it, so memory recovers without
letting the kernel OOM-kill the entire container.
* /data/training-pairs.jsonl idle >= 20m AND no recent 'kept:' line
-> tear down + respawn bulk-ingest-parallel.
* push-training-to-hf log idle >= 10m while file grew -> kick the
cron once.
Started in start.sh BEFORE the memory-hungry workers so it's ready
to act as soon as pressure builds.

2. NUM_SHARDS 6 -> 4 and PARQUET_PARALLEL 6 -> 2. Empirically the
'datasets' library peaks 1.5-2 GB during parquet decode under load,
so 6 simultaneous shards still breached the 16 GB ceiling. 4 shards
x ~2 GB peak + 2 parquet DLs x ~0.5 GB + 30 daemons + Python heap
leaves ~3 GB headroom for the watchdog to react before OOM.

3. Status server allowlist: 'self-heal-watchdog' added so the watchdog
log is queryable via /logs/self-heal-watchdog (ticks + heal events).

Files changed (4) hide show

bin/bulk-ingest-parallel.sh +9 -5
bin/hermes-status-server.py +1 -1
bin/self-heal-watchdog.sh +118 -0
start.sh +15 -5

bin/bulk-ingest-parallel.sh CHANGED Viewed

@@ -10,11 +10,15 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
 LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
 mkdir -p "$(dirname "$LOG")"
-NUM_SHARDS="${INGEST_SHARDS:-6}"           # was 16 — caused Memory limit exceeded (16Gi)
-                                            # on cpu-basic. Each shard streams
-                                            # ~1 GB peak via 'datasets' lib.
-                                            # 6 shards x ~1 GB + 30 daemons +
-                                            # Python heap fits comfortably under 16 GB.
 SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}"     # 2 min between shard cycles
 echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"

 LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
 mkdir -p "$(dirname "$LOG")"
+NUM_SHARDS="${INGEST_SHARDS:-4}"           # was 16 -> 6 -> 4. cpu-basic 16Gi
+                                            # cap was breached even with 6
+                                            # shards because 'datasets' lib
+                                            # peaks ~1.5-2 GB during parquet
+                                            # decode under load. 4 shards +
+                                            # parquet-direct (2 DLs) + 30
+                                            # daemons fits comfortably with
+                                            # ~3 GB headroom for the watchdog
+                                            # to react before OOM.
 SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}"     # 2 min between shard cycles
 echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"

bin/hermes-status-server.py CHANGED Viewed

@@ -166,7 +166,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
         "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
         "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
         "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
-        "qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
         "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
         "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
     }

         "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
         "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
         "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
+        "qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog",
         "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
         "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
     }

bin/self-heal-watchdog.sh ADDED Viewed

	@@ -0,0 +1,118 @@

+#!/usr/bin/env bash
+# Self-healing watchdog — runs on HF Space, not Mac.
+#
+# Monitors three failure modes that previously needed manual intervention:
+#   1. Memory pressure (>85% of 16Gi cap) — preempt OOM by killing the
+#      youngest dataset-enrich shard. Earlier shards are preferred to live
+#      because they're closer to completing their iteration.
+#   2. Stuck ingestion — if /data/training-pairs.jsonl hasn't grown in 20 min
+#      AND bulk-ingest log hasn't logged "kept:" in 15 min, force-restart
+#      bulk-ingest-parallel.
+#   3. Stale upload backlog — if push-training-to-hf.sh hasn't logged anything
+#      in 10 min while file has grown, kick the cron once.
+#
+# Conservative: never kills more than one process per cycle. Always logs the
+# decision and the trigger so post-mortem is easy.
+set -uo pipefail
+LOG="$HOME/.surrogate/logs/self-heal-watchdog.log"
+mkdir -p "$(dirname "$LOG")"
+MEM_THRESHOLD_PCT="${MEM_THRESHOLD_PCT:-85}"
+STUCK_INGEST_MIN="${STUCK_INGEST_MIN:-20}"
+STUCK_PUSH_MIN="${STUCK_PUSH_MIN:-10}"
+TICK_SEC="${TICK_SEC:-60}"
+log() { echo "[$(date +%Y-%m-%d\ %H:%M:%S)] $*" | tee -a "$LOG"; }
+mem_pct() {
+    # Container's RAM as %. /proc/meminfo on Linux only.
+    awk '/^MemTotal:/ {tot=$2} /^MemAvailable:/ {avail=$2}
+         END {if (tot>0) printf "%d", (tot-avail)*100/tot; else print 0}' /proc/meminfo 2>/dev/null
+}
+mtime_age_min() {
+    [[ -f "$1" ]] || { echo 99999; return; }
+    local mt now
+    mt=$(stat -c %Y "$1" 2>/dev/null || echo 0)
+    now=$(date +%s)
+    echo $(( (now - mt) / 60 ))
+}
+last_kept_age_min() {
+    local log_path="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
+    [[ -f "$log_path" ]] || { echo 99999; return; }
+    # Find the most recent "kept:" line and compute its age via the log mtime
+    # as a coarse proxy. Precise per-line timestamping is a future refinement.
+    if grep -q "kept:" "$log_path" 2>/dev/null; then
+        mtime_age_min "$log_path"
+    else
+        echo 99999
+    fi
+}
+heal_memory() {
+    local pct="$1"
+    log "MEMORY ALERT pct=$pct% threshold=$MEM_THRESHOLD_PCT% — preempting OOM"
+    # Find the youngest (highest PID) dataset-enrich shard process and SIGTERM.
+    # The shard loop sleeps SHARD_COOLDOWN before respawning, so memory recovers.
+    local victim
+    victim=$(pgrep -f "dataset-enrich.sh" | sort -nr | head -1)
+    if [[ -n "$victim" ]]; then
+        log "  -> kill youngest dataset-enrich pid=$victim"
+        kill -TERM "$victim" 2>/dev/null || true
+    else
+        log "  -> no dataset-enrich processes found; nothing to preempt"
+    fi
+}
+heal_stuck_ingest() {
+    log "STUCK INGEST — pairs file age >${STUCK_INGEST_MIN}m AND no recent kept: lines"
+    # Force-restart the parallel-ingest manager. Existing shard children get
+    # reaped; the next supervisor cycle re-spawns them with fresh state.
+    local victims
+    victims=$(pgrep -f "bulk-ingest-parallel.sh")
+    if [[ -n "$victims" ]]; then
+        log "  -> SIGTERM bulk-ingest-parallel pids: $victims"
+        echo "$victims" | xargs -r kill -TERM 2>/dev/null || true
+        sleep 5
+        nohup bash "$HOME/.surrogate/bin/bulk-ingest-parallel.sh" \
+            >> "$HOME/.surrogate/logs/bulk-ingest-parallel.log" 2>&1 &
+        log "  -> respawned bulk-ingest-parallel pid=$!"
+    else
+        log "  -> bulk-ingest-parallel not running; spawning fresh"
+        nohup bash "$HOME/.surrogate/bin/bulk-ingest-parallel.sh" \
+            >> "$HOME/.surrogate/logs/bulk-ingest-parallel.log" 2>&1 &
+    fi
+}
+heal_stale_push() {
+    log "STALE PUSH — uploader log idle >${STUCK_PUSH_MIN}m while file grew"
+    nohup bash "$HOME/.surrogate/bin/push-training-to-hf.sh" \
+        >> "$HOME/.surrogate/logs/training-push.log" 2>&1 &
+    log "  -> kicked push-training-to-hf pid=$!"
+}
+log "watchdog start — mem_threshold=${MEM_THRESHOLD_PCT}% stuck_ingest=${STUCK_INGEST_MIN}m stuck_push=${STUCK_PUSH_MIN}m tick=${TICK_SEC}s"
+while true; do
+    pct=$(mem_pct)
+    pairs_age=$(mtime_age_min "$HOME/.surrogate/training-pairs.jsonl")
+    kept_age=$(last_kept_age_min)
+    push_age=$(mtime_age_min "$HOME/.surrogate/logs/training-push.log")
+    # Memory healing has highest priority — OOM kills the whole container.
+    if [[ "$pct" -ge "$MEM_THRESHOLD_PCT" ]]; then
+        heal_memory "$pct"
+    # Then stuck ingestion (no new pairs AND no fresh kept: lines)
+    elif [[ "$pairs_age" -ge "$STUCK_INGEST_MIN" ]] && [[ "$kept_age" -ge 15 ]]; then
+        heal_stuck_ingest
+    # Then stale uploader (new pairs queued but uploader hasn't run)
+    elif [[ "$pairs_age" -lt 5 ]] && [[ "$push_age" -ge "$STUCK_PUSH_MIN" ]]; then
+        heal_stale_push
+    fi
+    # One-line heartbeat per tick — easy to grep for "things are fine"
+    log "tick mem=${pct}% pairs_age=${pairs_age}m kept_age=${kept_age}m push_age=${push_age}m"
+    sleep "$TICK_SEC"
+done

start.sh CHANGED Viewed

@@ -255,16 +255,26 @@ echo "[$(date +%H:%M:%S)] hf-dataset-discoverer started (continuous mega-mix hun
 nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
 echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
-# ── 7f. PARALLEL BULK INGEST (16 shards by slug-hash, drain 293M cap) ───────
 nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
-echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (16 shards, 293M total cap)" >> "$LOG_DIR/boot.log"
 # ── 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10× faster) ──
 # Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
 # Targets only trillion-scale corpora where streaming is too slow.
-# 6 parallel downloads — coordinated with bulk-ingest via central dedup store.
-nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
-echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (6 parallel DLs)" >> "$LOG_DIR/boot.log"
 # ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
 nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &

 nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
 echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
+# ── 7e1. SELF-HEAL WATCHDOG — must start BEFORE memory-hungry workers ───────
+# Monitors RAM usage every 60s; preempts youngest dataset-enrich shard if
+# usage >= 85% to dodge the cpu-basic 16Gi OOM kill that would otherwise
+# crash the entire container. Also restarts stuck ingest / kicks stale uploader.
+nohup bash ~/.surrogate/bin/self-heal-watchdog.sh > "$LOG_DIR/self-heal-watchdog.log" 2>&1 &
+echo "[$(date +%H:%M:%S)] self-heal-watchdog started (mem<85%, ingest<20m, push<10m)" >> "$LOG_DIR/boot.log"
+# ── 7f. PARALLEL BULK INGEST (slug-hash sharded; 6 shards on cpu-basic) ─────
+# Was 16 shards but caused 'Memory limit exceeded (16Gi)' OOM. Each shard
+# peaks ~1 GB while streaming via 'datasets' lib. Watchdog above provides
+# a second safety net if peak still spikes.
 nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
+echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (6 shards, 293M total cap)" >> "$LOG_DIR/boot.log"
 # ── 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10× faster) ──
 # Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
 # Targets only trillion-scale corpora where streaming is too slow.
+# DLs reduced to 2 parallel — combined with 6 ingest shards stays under 16Gi.
+PARQUET_PARALLEL=2 nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
+echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (2 parallel DLs)" >> "$LOG_DIR/boot.log"
 # ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
 nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &