Ashira Pitchayapakayakul commited on
Commit
859b78d
·
1 Parent(s): 99cff37

feat: self-heal-watchdog + drop shards 6→4 to dodge OOM kill

Browse files

Monitor caught a second 'Memory limit exceeded (16Gi)' even with shards
already reduced to 6. Two changes to break the OOM-restart cycle:

1. NEW: bin/self-heal-watchdog.sh — runs on the Space, ticks every 60s,
and acts on three failure modes that previously needed manual fixes:
* mem >= 85% -> SIGTERM the youngest dataset-enrich shard. Shard
loop's cooldown then respawns it, so memory recovers without
letting the kernel OOM-kill the entire container.
* /data/training-pairs.jsonl idle >= 20m AND no recent 'kept:' line
-> tear down + respawn bulk-ingest-parallel.
* push-training-to-hf log idle >= 10m while file grew -> kick the
cron once.
Started in start.sh BEFORE the memory-hungry workers so it's ready
to act as soon as pressure builds.

2. NUM_SHARDS 6 -> 4 and PARQUET_PARALLEL 6 -> 2. Empirically the
'datasets' library peaks 1.5-2 GB during parquet decode under load,
so 6 simultaneous shards still breached the 16 GB ceiling. 4 shards
x ~2 GB peak + 2 parquet DLs x ~0.5 GB + 30 daemons + Python heap
leaves ~3 GB headroom for the watchdog to react before OOM.

3. Status server allowlist: 'self-heal-watchdog' added so the watchdog
log is queryable via /logs/self-heal-watchdog (ticks + heal events).

bin/bulk-ingest-parallel.sh CHANGED
@@ -10,11 +10,15 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
10
  LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
11
  mkdir -p "$(dirname "$LOG")"
12
 
13
- NUM_SHARDS="${INGEST_SHARDS:-6}" # was 16 caused Memory limit exceeded (16Gi)
14
- # on cpu-basic. Each shard streams
15
- # ~1 GB peak via 'datasets' lib.
16
- # 6 shards x ~1 GB + 30 daemons +
17
- # Python heap fits comfortably under 16 GB.
 
 
 
 
18
  SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles
19
 
20
  echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
 
10
  LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
11
  mkdir -p "$(dirname "$LOG")"
12
 
13
+ NUM_SHARDS="${INGEST_SHARDS:-4}" # was 16 -> 6 -> 4. cpu-basic 16Gi
14
+ # cap was breached even with 6
15
+ # shards because 'datasets' lib
16
+ # peaks ~1.5-2 GB during parquet
17
+ # decode under load. 4 shards +
18
+ # parquet-direct (2 DLs) + 30
19
+ # daemons fits comfortably with
20
+ # ~3 GB headroom for the watchdog
21
+ # to react before OOM.
22
  SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles
23
 
24
  echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
bin/hermes-status-server.py CHANGED
@@ -166,7 +166,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
166
  "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
167
  "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
168
  "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
169
- "qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
170
  "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
171
  "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
172
  }
 
166
  "auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
167
  "hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
168
  "surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
169
+ "qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog",
170
  "ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
171
  "ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
172
  }
bin/self-heal-watchdog.sh ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Self-healing watchdog — runs on HF Space, not Mac.
3
+ #
4
+ # Monitors three failure modes that previously needed manual intervention:
5
+ # 1. Memory pressure (>85% of 16Gi cap) — preempt OOM by killing the
6
+ # youngest dataset-enrich shard. Earlier shards are preferred to live
7
+ # because they're closer to completing their iteration.
8
+ # 2. Stuck ingestion — if /data/training-pairs.jsonl hasn't grown in 20 min
9
+ # AND bulk-ingest log hasn't logged "kept:" in 15 min, force-restart
10
+ # bulk-ingest-parallel.
11
+ # 3. Stale upload backlog — if push-training-to-hf.sh hasn't logged anything
12
+ # in 10 min while file has grown, kick the cron once.
13
+ #
14
+ # Conservative: never kills more than one process per cycle. Always logs the
15
+ # decision and the trigger so post-mortem is easy.
16
+
17
+ set -uo pipefail
18
+ LOG="$HOME/.surrogate/logs/self-heal-watchdog.log"
19
+ mkdir -p "$(dirname "$LOG")"
20
+
21
+ MEM_THRESHOLD_PCT="${MEM_THRESHOLD_PCT:-85}"
22
+ STUCK_INGEST_MIN="${STUCK_INGEST_MIN:-20}"
23
+ STUCK_PUSH_MIN="${STUCK_PUSH_MIN:-10}"
24
+ TICK_SEC="${TICK_SEC:-60}"
25
+
26
+ log() { echo "[$(date +%Y-%m-%d\ %H:%M:%S)] $*" | tee -a "$LOG"; }
27
+
28
+ mem_pct() {
29
+ # Container's RAM as %. /proc/meminfo on Linux only.
30
+ awk '/^MemTotal:/ {tot=$2} /^MemAvailable:/ {avail=$2}
31
+ END {if (tot>0) printf "%d", (tot-avail)*100/tot; else print 0}' /proc/meminfo 2>/dev/null
32
+ }
33
+
34
+ mtime_age_min() {
35
+ [[ -f "$1" ]] || { echo 99999; return; }
36
+ local mt now
37
+ mt=$(stat -c %Y "$1" 2>/dev/null || echo 0)
38
+ now=$(date +%s)
39
+ echo $(( (now - mt) / 60 ))
40
+ }
41
+
42
+ last_kept_age_min() {
43
+ local log_path="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
44
+ [[ -f "$log_path" ]] || { echo 99999; return; }
45
+ # Find the most recent "kept:" line and compute its age via the log mtime
46
+ # as a coarse proxy. Precise per-line timestamping is a future refinement.
47
+ if grep -q "kept:" "$log_path" 2>/dev/null; then
48
+ mtime_age_min "$log_path"
49
+ else
50
+ echo 99999
51
+ fi
52
+ }
53
+
54
+ heal_memory() {
55
+ local pct="$1"
56
+ log "MEMORY ALERT pct=$pct% threshold=$MEM_THRESHOLD_PCT% — preempting OOM"
57
+ # Find the youngest (highest PID) dataset-enrich shard process and SIGTERM.
58
+ # The shard loop sleeps SHARD_COOLDOWN before respawning, so memory recovers.
59
+ local victim
60
+ victim=$(pgrep -f "dataset-enrich.sh" | sort -nr | head -1)
61
+ if [[ -n "$victim" ]]; then
62
+ log " -> kill youngest dataset-enrich pid=$victim"
63
+ kill -TERM "$victim" 2>/dev/null || true
64
+ else
65
+ log " -> no dataset-enrich processes found; nothing to preempt"
66
+ fi
67
+ }
68
+
69
+ heal_stuck_ingest() {
70
+ log "STUCK INGEST — pairs file age >${STUCK_INGEST_MIN}m AND no recent kept: lines"
71
+ # Force-restart the parallel-ingest manager. Existing shard children get
72
+ # reaped; the next supervisor cycle re-spawns them with fresh state.
73
+ local victims
74
+ victims=$(pgrep -f "bulk-ingest-parallel.sh")
75
+ if [[ -n "$victims" ]]; then
76
+ log " -> SIGTERM bulk-ingest-parallel pids: $victims"
77
+ echo "$victims" | xargs -r kill -TERM 2>/dev/null || true
78
+ sleep 5
79
+ nohup bash "$HOME/.surrogate/bin/bulk-ingest-parallel.sh" \
80
+ >> "$HOME/.surrogate/logs/bulk-ingest-parallel.log" 2>&1 &
81
+ log " -> respawned bulk-ingest-parallel pid=$!"
82
+ else
83
+ log " -> bulk-ingest-parallel not running; spawning fresh"
84
+ nohup bash "$HOME/.surrogate/bin/bulk-ingest-parallel.sh" \
85
+ >> "$HOME/.surrogate/logs/bulk-ingest-parallel.log" 2>&1 &
86
+ fi
87
+ }
88
+
89
+ heal_stale_push() {
90
+ log "STALE PUSH — uploader log idle >${STUCK_PUSH_MIN}m while file grew"
91
+ nohup bash "$HOME/.surrogate/bin/push-training-to-hf.sh" \
92
+ >> "$HOME/.surrogate/logs/training-push.log" 2>&1 &
93
+ log " -> kicked push-training-to-hf pid=$!"
94
+ }
95
+
96
+ log "watchdog start — mem_threshold=${MEM_THRESHOLD_PCT}% stuck_ingest=${STUCK_INGEST_MIN}m stuck_push=${STUCK_PUSH_MIN}m tick=${TICK_SEC}s"
97
+
98
+ while true; do
99
+ pct=$(mem_pct)
100
+ pairs_age=$(mtime_age_min "$HOME/.surrogate/training-pairs.jsonl")
101
+ kept_age=$(last_kept_age_min)
102
+ push_age=$(mtime_age_min "$HOME/.surrogate/logs/training-push.log")
103
+
104
+ # Memory healing has highest priority — OOM kills the whole container.
105
+ if [[ "$pct" -ge "$MEM_THRESHOLD_PCT" ]]; then
106
+ heal_memory "$pct"
107
+ # Then stuck ingestion (no new pairs AND no fresh kept: lines)
108
+ elif [[ "$pairs_age" -ge "$STUCK_INGEST_MIN" ]] && [[ "$kept_age" -ge 15 ]]; then
109
+ heal_stuck_ingest
110
+ # Then stale uploader (new pairs queued but uploader hasn't run)
111
+ elif [[ "$pairs_age" -lt 5 ]] && [[ "$push_age" -ge "$STUCK_PUSH_MIN" ]]; then
112
+ heal_stale_push
113
+ fi
114
+
115
+ # One-line heartbeat per tick — easy to grep for "things are fine"
116
+ log "tick mem=${pct}% pairs_age=${pairs_age}m kept_age=${kept_age}m push_age=${push_age}m"
117
+ sleep "$TICK_SEC"
118
+ done
start.sh CHANGED
@@ -255,16 +255,26 @@ echo "[$(date +%H:%M:%S)] hf-dataset-discoverer started (continuous mega-mix hun
255
  nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
256
  echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
257
 
258
- # ── 7f. PARALLEL BULK INGEST (16 shards by slug-hash, drain 293M cap) ───────
 
 
 
 
 
 
 
 
 
 
259
  nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
260
- echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (16 shards, 293M total cap)" >> "$LOG_DIR/boot.log"
261
 
262
  # ── 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10× faster) ──
263
  # Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
264
  # Targets only trillion-scale corpora where streaming is too slow.
265
- # 6 parallel downloadscoordinated with bulk-ingest via central dedup store.
266
- nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
267
- echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (6 parallel DLs)" >> "$LOG_DIR/boot.log"
268
 
269
  # ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
270
  nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
 
255
  nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
256
  echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
257
 
258
+ # ── 7e1. SELF-HEAL WATCHDOG must start BEFORE memory-hungry workers ───────
259
+ # Monitors RAM usage every 60s; preempts youngest dataset-enrich shard if
260
+ # usage >= 85% to dodge the cpu-basic 16Gi OOM kill that would otherwise
261
+ # crash the entire container. Also restarts stuck ingest / kicks stale uploader.
262
+ nohup bash ~/.surrogate/bin/self-heal-watchdog.sh > "$LOG_DIR/self-heal-watchdog.log" 2>&1 &
263
+ echo "[$(date +%H:%M:%S)] self-heal-watchdog started (mem<85%, ingest<20m, push<10m)" >> "$LOG_DIR/boot.log"
264
+
265
+ # ── 7f. PARALLEL BULK INGEST (slug-hash sharded; 6 shards on cpu-basic) ─────
266
+ # Was 16 shards but caused 'Memory limit exceeded (16Gi)' OOM. Each shard
267
+ # peaks ~1 GB while streaming via 'datasets' lib. Watchdog above provides
268
+ # a second safety net if peak still spikes.
269
  nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
270
+ echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (6 shards, 293M total cap)" >> "$LOG_DIR/boot.log"
271
 
272
  # ── 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10× faster) ──
273
  # Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
274
  # Targets only trillion-scale corpora where streaming is too slow.
275
+ # DLs reduced to 2 parallel combined with 6 ingest shards stays under 16Gi.
276
+ PARQUET_PARALLEL=2 nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
277
+ echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (2 parallel DLs)" >> "$LOG_DIR/boot.log"
278
 
279
  # ── 7c. Skill-synthesis daemon (extract patterns from cloned repos → skills) ─
280
  nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &