Spaces:
Runtime error
feat: self-heal-watchdog + drop shards 6→4 to dodge OOM kill
Browse filesMonitor caught a second 'Memory limit exceeded (16Gi)' even with shards
already reduced to 6. Two changes to break the OOM-restart cycle:
1. NEW: bin/self-heal-watchdog.sh — runs on the Space, ticks every 60s,
and acts on three failure modes that previously needed manual fixes:
* mem >= 85% -> SIGTERM the youngest dataset-enrich shard. Shard
loop's cooldown then respawns it, so memory recovers without
letting the kernel OOM-kill the entire container.
* /data/training-pairs.jsonl idle >= 20m AND no recent 'kept:' line
-> tear down + respawn bulk-ingest-parallel.
* push-training-to-hf log idle >= 10m while file grew -> kick the
cron once.
Started in start.sh BEFORE the memory-hungry workers so it's ready
to act as soon as pressure builds.
2. NUM_SHARDS 6 -> 4 and PARQUET_PARALLEL 6 -> 2. Empirically the
'datasets' library peaks 1.5-2 GB during parquet decode under load,
so 6 simultaneous shards still breached the 16 GB ceiling. 4 shards
x ~2 GB peak + 2 parquet DLs x ~0.5 GB + 30 daemons + Python heap
leaves ~3 GB headroom for the watchdog to react before OOM.
3. Status server allowlist: 'self-heal-watchdog' added so the watchdog
log is queryable via /logs/self-heal-watchdog (ticks + heal events).
- bin/bulk-ingest-parallel.sh +9 -5
- bin/hermes-status-server.py +1 -1
- bin/self-heal-watchdog.sh +118 -0
- start.sh +15 -5
|
@@ -10,11 +10,15 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
|
| 10 |
LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 11 |
mkdir -p "$(dirname "$LOG")"
|
| 12 |
|
| 13 |
-
NUM_SHARDS="${INGEST_SHARDS:-
|
| 14 |
-
#
|
| 15 |
-
#
|
| 16 |
-
#
|
| 17 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles
|
| 19 |
|
| 20 |
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
|
|
|
|
| 10 |
LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 11 |
mkdir -p "$(dirname "$LOG")"
|
| 12 |
|
| 13 |
+
NUM_SHARDS="${INGEST_SHARDS:-4}" # was 16 -> 6 -> 4. cpu-basic 16Gi
|
| 14 |
+
# cap was breached even with 6
|
| 15 |
+
# shards because 'datasets' lib
|
| 16 |
+
# peaks ~1.5-2 GB during parquet
|
| 17 |
+
# decode under load. 4 shards +
|
| 18 |
+
# parquet-direct (2 DLs) + 30
|
| 19 |
+
# daemons fits comfortably with
|
| 20 |
+
# ~3 GB headroom for the watchdog
|
| 21 |
+
# to react before OOM.
|
| 22 |
SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles
|
| 23 |
|
| 24 |
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
|
|
@@ -166,7 +166,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
|
|
| 166 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 167 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 168 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 169 |
-
"qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 170 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 171 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 172 |
}
|
|
|
|
| 166 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 167 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 168 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 169 |
+
"qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog",
|
| 170 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 171 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 172 |
}
|
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Self-healing watchdog — runs on HF Space, not Mac.
|
| 3 |
+
#
|
| 4 |
+
# Monitors three failure modes that previously needed manual intervention:
|
| 5 |
+
# 1. Memory pressure (>85% of 16Gi cap) — preempt OOM by killing the
|
| 6 |
+
# youngest dataset-enrich shard. Earlier shards are preferred to live
|
| 7 |
+
# because they're closer to completing their iteration.
|
| 8 |
+
# 2. Stuck ingestion — if /data/training-pairs.jsonl hasn't grown in 20 min
|
| 9 |
+
# AND bulk-ingest log hasn't logged "kept:" in 15 min, force-restart
|
| 10 |
+
# bulk-ingest-parallel.
|
| 11 |
+
# 3. Stale upload backlog — if push-training-to-hf.sh hasn't logged anything
|
| 12 |
+
# in 10 min while file has grown, kick the cron once.
|
| 13 |
+
#
|
| 14 |
+
# Conservative: never kills more than one process per cycle. Always logs the
|
| 15 |
+
# decision and the trigger so post-mortem is easy.
|
| 16 |
+
|
| 17 |
+
set -uo pipefail
|
| 18 |
+
LOG="$HOME/.surrogate/logs/self-heal-watchdog.log"
|
| 19 |
+
mkdir -p "$(dirname "$LOG")"
|
| 20 |
+
|
| 21 |
+
MEM_THRESHOLD_PCT="${MEM_THRESHOLD_PCT:-85}"
|
| 22 |
+
STUCK_INGEST_MIN="${STUCK_INGEST_MIN:-20}"
|
| 23 |
+
STUCK_PUSH_MIN="${STUCK_PUSH_MIN:-10}"
|
| 24 |
+
TICK_SEC="${TICK_SEC:-60}"
|
| 25 |
+
|
| 26 |
+
log() { echo "[$(date +%Y-%m-%d\ %H:%M:%S)] $*" | tee -a "$LOG"; }
|
| 27 |
+
|
| 28 |
+
mem_pct() {
|
| 29 |
+
# Container's RAM as %. /proc/meminfo on Linux only.
|
| 30 |
+
awk '/^MemTotal:/ {tot=$2} /^MemAvailable:/ {avail=$2}
|
| 31 |
+
END {if (tot>0) printf "%d", (tot-avail)*100/tot; else print 0}' /proc/meminfo 2>/dev/null
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
mtime_age_min() {
|
| 35 |
+
[[ -f "$1" ]] || { echo 99999; return; }
|
| 36 |
+
local mt now
|
| 37 |
+
mt=$(stat -c %Y "$1" 2>/dev/null || echo 0)
|
| 38 |
+
now=$(date +%s)
|
| 39 |
+
echo $(( (now - mt) / 60 ))
|
| 40 |
+
}
|
| 41 |
+
|
| 42 |
+
last_kept_age_min() {
|
| 43 |
+
local log_path="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 44 |
+
[[ -f "$log_path" ]] || { echo 99999; return; }
|
| 45 |
+
# Find the most recent "kept:" line and compute its age via the log mtime
|
| 46 |
+
# as a coarse proxy. Precise per-line timestamping is a future refinement.
|
| 47 |
+
if grep -q "kept:" "$log_path" 2>/dev/null; then
|
| 48 |
+
mtime_age_min "$log_path"
|
| 49 |
+
else
|
| 50 |
+
echo 99999
|
| 51 |
+
fi
|
| 52 |
+
}
|
| 53 |
+
|
| 54 |
+
heal_memory() {
|
| 55 |
+
local pct="$1"
|
| 56 |
+
log "MEMORY ALERT pct=$pct% threshold=$MEM_THRESHOLD_PCT% — preempting OOM"
|
| 57 |
+
# Find the youngest (highest PID) dataset-enrich shard process and SIGTERM.
|
| 58 |
+
# The shard loop sleeps SHARD_COOLDOWN before respawning, so memory recovers.
|
| 59 |
+
local victim
|
| 60 |
+
victim=$(pgrep -f "dataset-enrich.sh" | sort -nr | head -1)
|
| 61 |
+
if [[ -n "$victim" ]]; then
|
| 62 |
+
log " -> kill youngest dataset-enrich pid=$victim"
|
| 63 |
+
kill -TERM "$victim" 2>/dev/null || true
|
| 64 |
+
else
|
| 65 |
+
log " -> no dataset-enrich processes found; nothing to preempt"
|
| 66 |
+
fi
|
| 67 |
+
}
|
| 68 |
+
|
| 69 |
+
heal_stuck_ingest() {
|
| 70 |
+
log "STUCK INGEST — pairs file age >${STUCK_INGEST_MIN}m AND no recent kept: lines"
|
| 71 |
+
# Force-restart the parallel-ingest manager. Existing shard children get
|
| 72 |
+
# reaped; the next supervisor cycle re-spawns them with fresh state.
|
| 73 |
+
local victims
|
| 74 |
+
victims=$(pgrep -f "bulk-ingest-parallel.sh")
|
| 75 |
+
if [[ -n "$victims" ]]; then
|
| 76 |
+
log " -> SIGTERM bulk-ingest-parallel pids: $victims"
|
| 77 |
+
echo "$victims" | xargs -r kill -TERM 2>/dev/null || true
|
| 78 |
+
sleep 5
|
| 79 |
+
nohup bash "$HOME/.surrogate/bin/bulk-ingest-parallel.sh" \
|
| 80 |
+
>> "$HOME/.surrogate/logs/bulk-ingest-parallel.log" 2>&1 &
|
| 81 |
+
log " -> respawned bulk-ingest-parallel pid=$!"
|
| 82 |
+
else
|
| 83 |
+
log " -> bulk-ingest-parallel not running; spawning fresh"
|
| 84 |
+
nohup bash "$HOME/.surrogate/bin/bulk-ingest-parallel.sh" \
|
| 85 |
+
>> "$HOME/.surrogate/logs/bulk-ingest-parallel.log" 2>&1 &
|
| 86 |
+
fi
|
| 87 |
+
}
|
| 88 |
+
|
| 89 |
+
heal_stale_push() {
|
| 90 |
+
log "STALE PUSH — uploader log idle >${STUCK_PUSH_MIN}m while file grew"
|
| 91 |
+
nohup bash "$HOME/.surrogate/bin/push-training-to-hf.sh" \
|
| 92 |
+
>> "$HOME/.surrogate/logs/training-push.log" 2>&1 &
|
| 93 |
+
log " -> kicked push-training-to-hf pid=$!"
|
| 94 |
+
}
|
| 95 |
+
|
| 96 |
+
log "watchdog start — mem_threshold=${MEM_THRESHOLD_PCT}% stuck_ingest=${STUCK_INGEST_MIN}m stuck_push=${STUCK_PUSH_MIN}m tick=${TICK_SEC}s"
|
| 97 |
+
|
| 98 |
+
while true; do
|
| 99 |
+
pct=$(mem_pct)
|
| 100 |
+
pairs_age=$(mtime_age_min "$HOME/.surrogate/training-pairs.jsonl")
|
| 101 |
+
kept_age=$(last_kept_age_min)
|
| 102 |
+
push_age=$(mtime_age_min "$HOME/.surrogate/logs/training-push.log")
|
| 103 |
+
|
| 104 |
+
# Memory healing has highest priority — OOM kills the whole container.
|
| 105 |
+
if [[ "$pct" -ge "$MEM_THRESHOLD_PCT" ]]; then
|
| 106 |
+
heal_memory "$pct"
|
| 107 |
+
# Then stuck ingestion (no new pairs AND no fresh kept: lines)
|
| 108 |
+
elif [[ "$pairs_age" -ge "$STUCK_INGEST_MIN" ]] && [[ "$kept_age" -ge 15 ]]; then
|
| 109 |
+
heal_stuck_ingest
|
| 110 |
+
# Then stale uploader (new pairs queued but uploader hasn't run)
|
| 111 |
+
elif [[ "$pairs_age" -lt 5 ]] && [[ "$push_age" -ge "$STUCK_PUSH_MIN" ]]; then
|
| 112 |
+
heal_stale_push
|
| 113 |
+
fi
|
| 114 |
+
|
| 115 |
+
# One-line heartbeat per tick — easy to grep for "things are fine"
|
| 116 |
+
log "tick mem=${pct}% pairs_age=${pairs_age}m kept_age=${kept_age}m push_age=${push_age}m"
|
| 117 |
+
sleep "$TICK_SEC"
|
| 118 |
+
done
|
|
@@ -255,16 +255,26 @@ echo "[$(date +%H:%M:%S)] hf-dataset-discoverer started (continuous mega-mix hun
|
|
| 255 |
nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
|
| 256 |
echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
|
| 257 |
|
| 258 |
-
# ──
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 259 |
nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
|
| 260 |
-
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (
|
| 261 |
|
| 262 |
# ── 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10× faster) ──
|
| 263 |
# Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
|
| 264 |
# Targets only trillion-scale corpora where streaming is too slow.
|
| 265 |
-
#
|
| 266 |
-
nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
|
| 267 |
-
echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (
|
| 268 |
|
| 269 |
# ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
|
| 270 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
|
|
|
| 255 |
nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
|
| 256 |
echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
|
| 257 |
|
| 258 |
+
# ── 7e1. SELF-HEAL WATCHDOG — must start BEFORE memory-hungry workers ───────
|
| 259 |
+
# Monitors RAM usage every 60s; preempts youngest dataset-enrich shard if
|
| 260 |
+
# usage >= 85% to dodge the cpu-basic 16Gi OOM kill that would otherwise
|
| 261 |
+
# crash the entire container. Also restarts stuck ingest / kicks stale uploader.
|
| 262 |
+
nohup bash ~/.surrogate/bin/self-heal-watchdog.sh > "$LOG_DIR/self-heal-watchdog.log" 2>&1 &
|
| 263 |
+
echo "[$(date +%H:%M:%S)] self-heal-watchdog started (mem<85%, ingest<20m, push<10m)" >> "$LOG_DIR/boot.log"
|
| 264 |
+
|
| 265 |
+
# ── 7f. PARALLEL BULK INGEST (slug-hash sharded; 6 shards on cpu-basic) ─────
|
| 266 |
+
# Was 16 shards but caused 'Memory limit exceeded (16Gi)' OOM. Each shard
|
| 267 |
+
# peaks ~1 GB while streaming via 'datasets' lib. Watchdog above provides
|
| 268 |
+
# a second safety net if peak still spikes.
|
| 269 |
nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
|
| 270 |
+
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (6 shards, 293M total cap)" >> "$LOG_DIR/boot.log"
|
| 271 |
|
| 272 |
# ── 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10× faster) ──
|
| 273 |
# Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
|
| 274 |
# Targets only trillion-scale corpora where streaming is too slow.
|
| 275 |
+
# DLs reduced to 2 parallel — combined with 6 ingest shards stays under 16Gi.
|
| 276 |
+
PARQUET_PARALLEL=2 nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
|
| 277 |
+
echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (2 parallel DLs)" >> "$LOG_DIR/boot.log"
|
| 278 |
|
| 279 |
# ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
|
| 280 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|