Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 8 days ago

Commit

326e0f0

1 Parent(s): 8b4b1b5

feat(overnight): scoring rubric + report generator + multi-window orchestrator

Adds the wake-up evaluation pipeline so user gets a single scored markdown
report + Discord ping in the morning showing exactly how far V8 got with
autonomous monitoring + incident response + autonomous coding.

overnight-rubric.md (~/.surrogate/state/specs/) — 4-axis weighted scoring:
• Coding Capability (40%) — bench wins 7+/9, BFCL ≥70, Multi-IaC+ITBench
≥55, ≥10 draft PRs on arkship Phase B
• Incident Response (25%) — diagnosis ≥80% w/ conf 0.7+, verifier ≥40%,
apply ≥80% no rollback, 0 destructive queued
• Monitoring (15%) — sweep cadence ≥95%, probe coverage, watchdog clean
• Agentic Breadth (20%) — trigger types, fix-kind diversity, self-improve
flywheel produced data, idempotency clean
• Verdict cutoffs: 80+ "agentic ตัวจริง" / 60-79 "partial" / 40-59 "flaky" / <40 "not yet"

overnight-report.sh — gathers all evidence (bench summary, outcomes.jsonl,
queue/, release cycles, self-improve, watchdog state) and renders a dense
markdown report at ~/.surrogate/state/overnight-report-YYYY-MM-DD.md +
sends Discord TLDR. Idempotent — can be re-run anytime, overwrites file.

overnight-pipeline.sh — one-shot orchestrator that brings up the pipeline:
1. Sanity-checks auto-swap-and-bench is alive + tokens present
2. Starts watchdog FIRST (kill-switch armed before risk)
3. Starts autonomous-sre.sh (baseline against v1, auto-cuts to V8 post-swap)
4. Spawns release-gate.sh — polls bench summary; only fires
autonomous-release.sh if V8 wins ≥3/9 metrics (skips Branch C)
5. Hourly self-improve.sh tick to grow SFT/KTO/skills datasets
6. 3 wake-up windows for the report: 07:00 / 11:00 / 17:00 Bangkok
7. release-gate also calls overnight-report on bench-completion event

The report scoring + the verdict are deterministic, transparent, and the
user can rerun manually anytime: `bash bin/v2/overnight-report.sh`.

Files changed (2) hide show

bin/v2/overnight-pipeline.sh +239 -0
bin/v2/overnight-report.sh +471 -0

bin/v2/overnight-pipeline.sh ADDED Viewed

	@@ -0,0 +1,239 @@

+#!/usr/bin/env bash
+# Surrogate-1 — overnight pipeline orchestrator.
+#
+# One-shot bootstrap that brings up the eval pipeline so the user wakes up
+# tomorrow morning to a scored report. Wires the timing carefully because
+# V8 isn't deployed yet — we don't want autonomous-release.sh hammering v1.
+#
+# Timeline (T+0 = now):
+#   T+0          watchdog + autonomous-sre come online (runs against current
+#                v1 — that's the BASELINE measurement)
+#   T+~8h        Kaggle V#7 finishes → adapter pushed to Hub →
+#                auto-swap-and-bench daemon (already running, pid 94063+)
+#                swaps surrogate1 ZeroGPU to V8 LoRA
+#                FROM HERE ON, autonomous-sre is talking to V8
+#   T+~16h       bench-v1-vs-v15.sh completes → post-bench-decide fires
+#                Branch A/B/C
+#   T+~16-18h    self-improve.sh runs (cron-style, every 6 hr)
+#                If Branch A or B → autonomous-release.sh fires for arkship
+#   T+~22h       overnight-report.sh runs, scores, Discord ping
+#
+# What this script does NOW:
+#   1. Sanity checks (V8 daemon target correct, env tokens)
+#   2. Starts watchdog (FIRST, so kill-switch is armed before any real work)
+#   3. Starts autonomous-sre.sh
+#   4. Schedules a delayed autonomous-release.sh launcher (fires only if
+#      bench summary.json shows V8 wins ≥3/9 metrics — Branch A or B gate)
+#   5. Schedules self-improve.sh hourly tick
+#   6. Schedules overnight-report.sh at 06:00 Bangkok local
+#
+# All scheduling uses `at` if available, else background sleep loops.
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+HFB="$HOME/.surrogate/hf-space/bin/v2"
+STATE="$HOME/.surrogate/state"
+LOGS="$HOME/.surrogate/logs"
+mkdir -p "$STATE" "$LOGS"
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOGS/overnight-pipeline.log"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+log "═══ overnight-pipeline starting ═══"
+# ── 1. Sanity checks ────────────────────────────────────────────────────────
+log "── sanity checks ──"
+if ! pgrep -f auto-swap-and-bench.sh >/dev/null; then
+    log "  ✗ auto-swap-and-bench daemon NOT running — aborting"
+    notify "❌ overnight-pipeline aborted: auto-swap-and-bench not running"
+    exit 1
+fi
+log "  ✓ auto-swap-and-bench alive (pid $(pgrep -f auto-swap-and-bench.sh | head -1))"
+if [[ -z "${HF_TOKEN:-}" ]] || [[ -z "${DISCORD_WEBHOOK:-}" ]]; then
+    log "  ⚠ HF_TOKEN or DISCORD_WEBHOOK missing — pipeline will work but no notifications"
+fi
+# ── 2. Watchdog (FIRST, kill-switch armed before risk) ─────────────────────
+log "── starting watchdog ──"
+rm -f "$STATE/watchdog-killed"   # re-arm if previously tripped
+if pgrep -f "watchdog.sh" >/dev/null; then
+    log "  watchdog already running (pid $(pgrep -f watchdog.sh | head -1)) — leaving as-is"
+else
+    nohup bash "$HFB/watchdog.sh" > "$LOGS/watchdog.log" 2>&1 &
+    disown
+    sleep 2
+    if pgrep -f "watchdog.sh" >/dev/null; then
+        log "  ✓ watchdog started pid $(pgrep -f watchdog.sh | head -1)"
+    else
+        log "  ✗ watchdog failed to start — see $LOGS/watchdog.log"
+        exit 1
+    fi
+fi
+# ── 3. Autonomous SRE ──────────────────────────────────────────────────────
+log "── starting autonomous-sre ──"
+if pgrep -f "autonomous-sre.sh" >/dev/null; then
+    log "  already running (pid $(pgrep -f autonomous-sre.sh | head -1)) — leaving as-is"
+else
+    nohup bash "$HFB/autonomous-sre.sh" > "$LOGS/autonomous-sre.log" 2>&1 &
+    disown
+    sleep 2
+    if pgrep -f "autonomous-sre.sh" >/dev/null; then
+        log "  ✓ autonomous-sre started pid $(pgrep -f autonomous-sre.sh | head -1)"
+    else
+        log "  ✗ autonomous-sre failed to start"
+    fi
+fi
+# ── 4. Delayed autonomous-release launcher (fires only after bench passes) ─
+log "── scheduling delayed autonomous-release launcher ──"
+cat > "$STATE/release-gate.sh" <<'GATEEOF'
+#!/usr/bin/env bash
+# Polls bench summary.json. Fires autonomous-release.sh only if V8 wins
+# ≥3/9 metrics over v1 (Branch A or B threshold). Otherwise stays dormant
+# and notifies — autonomous coding work is gated on actual capability.
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+HFB="$HOME/.surrogate/hf-space/bin/v2"
+STATE="$HOME/.surrogate/state"
+LOGS="$HOME/.surrogate/logs"
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" >> "$LOGS/release-gate.log"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+# Poll up to MAX_HOURS for bench to complete
+MAX_HOURS="${MAX_HOURS:-20}"
+DEADLINE=$(( $(date +%s) + MAX_HOURS * 3600 ))
+n=0
+while [[ $(date +%s) -lt $DEADLINE ]]; do
+    n=$((n+1))
+    LATEST=$(ls -t "$HOME/.surrogate/eval/bench-v1-vs-v15-"*"/summary.json" 2>/dev/null | head -1)
+    if [[ -n "$LATEST" ]]; then
+        # Count V8 wins
+        wins=$(python3 -c "
+import json, sys
+try: d = json.load(open('$LATEST'))
+except: print(0); sys.exit(0)
+v8 = d.get('v1.2-research', {})
+v1 = d.get('v1', {})
+n = 0
+for k in ('humaneval_plus','mbpp_plus','lcb_v6','bfcl_v3_overall',
+          'ruler_16k_avg','swebench_verified_lite100','axentx_eval_50',
+          'multi_iac_eval','itbench_lite'):
+    try:
+        if float(v8.get(k,'0') or 0) - float(v1.get(k,'0') or 0) > 0.5: n += 1
+    except: continue
+print(n)
+" 2>/dev/null || echo 0)
+        # Only proceed when bench actually populated v1.2-research column
+        has_v8=$(python3 -c "
+import json
+d = json.load(open('$LATEST'))
+v8 = d.get('v1.2-research', {})
+print(1 if any(v8.values()) else 0)
+" 2>/dev/null)
+        if [[ "$has_v8" == "1" ]]; then
+            log "bench summary populated for v1.2-research, wins=$wins/9"
+            if (( wins >= 3 )); then
+                log "→ FIRE autonomous-release.sh (Branch A/B threshold met)"
+                notify "🚀 release-gate: V8 wins $wins/9 → starting autonomous-release on arkship"
+                nohup bash "$HFB/autonomous-release.sh" > "$LOGS/autonomous-release.log" 2>&1 &
+                disown
+                exit 0
+            else
+                log "→ HOLD (only $wins/9 wins, threshold ≥3 — Branch C territory)"
+                notify "⛔ release-gate: V8 only won $wins/9 → autonomous-release HELD (Branch C)"
+                exit 0
+            fi
+        fi
+    fi
+    sleep 600   # poll every 10 min
+done
+log "deadline reached without bench completion — exiting without firing"
+notify "⏰ release-gate: 20h deadline hit, bench never completed"
+GATEEOF
+chmod +x "$STATE/release-gate.sh"
+nohup bash "$STATE/release-gate.sh" >> "$LOGS/release-gate.log" 2>&1 &
+disown
+GATE_PID=$!
+log "  ✓ release-gate spawned pid=$GATE_PID (polls bench, gated fire)"
+# ── 5. Self-improve hourly tick ────────────────────────────────────────────
+log "── scheduling self-improve hourly tick ──"
+cat > "$STATE/self-improve-tick.sh" <<'TICKEOF'
+#!/usr/bin/env bash
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+while true; do
+    bash "$HOME/.surrogate/hf-space/bin/v2/self-improve.sh" all \
+        >> "$HOME/.surrogate/logs/self-improve.log" 2>&1
+    sleep 3600
+done
+TICKEOF
+chmod +x "$STATE/self-improve-tick.sh"
+nohup bash "$STATE/self-improve-tick.sh" >> "$LOGS/self-improve.log" 2>&1 &
+disown
+TICK_PID=$!
+log "  ✓ self-improve hourly tick pid=$TICK_PID"
+# ── 6. Schedule overnight-report — multiple wake-up windows ────────────────
+# V8 won't be fully benched until ~T+12-16h, so fire the report multiple
+# times so user gets meaningful TLDR no matter when they wake up. Each fire
+# is idempotent — overwrites today's report file with latest score.
+log "── scheduling overnight-report (multi-fire windows) ──"
+schedule_report() {
+    local label="$1" hour="$2"
+    local sleep_sec
+    sleep_sec=$(python3 -c "
+import datetime, zoneinfo
+now = datetime.datetime.now(zoneinfo.ZoneInfo('Asia/Bangkok'))
+target = now.replace(hour=$hour, minute=0, second=0, microsecond=0)
+if target <= now:
+    target += datetime.timedelta(days=1)
+print(int((target - now).total_seconds()))
+")
+    nohup bash -c "sleep $sleep_sec; bash '$HFB/overnight-report.sh' >> '$LOGS/overnight-report.log' 2>&1" \
+        >> "$LOGS/overnight-report.log" 2>&1 &
+    disown
+    local pid=$!
+    log "  ✓ $label fire @ ${hour}:00 Bangkok pid=$pid (in $(($sleep_sec/3600))h $(($sleep_sec%3600/60))m)"
+}
+# 3 wake-up windows: early-morning baseline, post-bench likely, evening
+schedule_report "early"   7
+schedule_report "mid"     11
+schedule_report "evening" 17
+# ── 7. Wire release-gate to fire overnight-report on bench completion ──────
+log "── adding event-triggered report fire to release-gate ──"
+# patch release-gate to call overnight-report after its decision
+cat >> "$STATE/release-gate.sh" <<EVTEOF
+# Auto-fire report after gate decides (event-triggered, complements time-based)
+bash "$HFB/overnight-report.sh" >> "$LOGS/overnight-report.log" 2>&1 || true
+EVTEOF
+log "  ✓ report fires automatically when release-gate completes"
+# ── 7. Pre-flight summary ──────────────────────────────────────────────────
+log ""
+log "═══ overnight pipeline ARMED ═══"
+log "  watchdog            : pid $(pgrep -f watchdog.sh | head -1)"
+log "  autonomous-sre      : pid $(pgrep -f autonomous-sre.sh | head -1)"
+log "  auto-swap-and-bench : pid $(pgrep -f auto-swap-and-bench.sh | head -1)"
+log "  release-gate        : pid $GATE_PID (polls bench, gated by ≥3/9 wins)"
+log "  self-improve tick   : pid $TICK_PID (every hour)"
+log "  overnight-report    : pid $REPORT_PID (fires at 06:00 Bangkok)"
+log ""
+log "  rubric:  $HOME/.surrogate/state/specs/overnight-rubric.md"
+log "  report:  $HOME/.surrogate/state/overnight-report-$(date +%Y-%m-%d).md (after fire)"
+notify "🌙 overnight pipeline armed — wakeup report at 06:00 (rubric: 4-axis × 100pts, Discord ping when done)"
+log "═══ done ═══"

bin/v2/overnight-report.sh ADDED Viewed

	@@ -0,0 +1,471 @@

+#!/usr/bin/env bash
+# Surrogate-1 — overnight evaluation report generator.
+#
+# Reads all evidence (bench summary, outcomes log, queue, autonomous-release
+# specs, daemon logs) and computes the 4-axis weighted score from
+# ~/.surrogate/state/specs/overnight-rubric.md.
+#
+# Output: a single dense markdown report at
+#   ~/.surrogate/state/overnight-report-YYYY-MM-DD.md
+# + Discord ping with TLDR + verdict.
+#
+# Run on schedule (cron at 06:00 Bangkok = 23:00 UTC) or manually:
+#   bash bin/v2/overnight-report.sh
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+STATE="$HOME/.surrogate/state"
+LOGS="$HOME/.surrogate/logs"
+EVAL="$HOME/.surrogate/eval"
+TODAY="$(date +%Y-%m-%d)"
+REPORT="$STATE/overnight-report-$TODAY.md"
+mkdir -p "$STATE"
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    local content="$1"
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$content")" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+# ── Compute everything in one Python pass for consistency ────────────────────
+python3 - "$REPORT" <<'PYEOF'
+import json, os, re, sys, glob, datetime as dt
+from pathlib import Path
+from collections import Counter
+REPORT = sys.argv[1]
+HOME = Path.home()
+STATE = HOME / ".surrogate/state"
+EVAL  = HOME / ".surrogate/eval"
+LOGS  = HOME / ".surrogate/logs"
+now = dt.datetime.now(dt.timezone.utc)
+def read_lines(p):
+    try: return Path(p).read_text().splitlines()
+    except: return []
+def jsonl(p):
+    rows = []
+    for L in read_lines(p):
+        try: rows.append(json.loads(L))
+        except: pass
+    return rows
+# ── Evidence: bench summary (4-way × 9 evals) ────────────────────────────────
+def latest_bench():
+    candidates = sorted(EVAL.glob("bench-v1-vs-v15-*/summary.json"),
+                        key=lambda p: p.stat().st_mtime, reverse=True)
+    return candidates[0] if candidates else None
+bench_path = latest_bench()
+bench = json.loads(bench_path.read_text()) if bench_path and bench_path.exists() else {}
+def score_metric(label, key, target):
+    """Compare V8 (v1.2-research) vs v1 baseline."""
+    try:
+        v8  = float(bench.get("v1.2-research", {}).get(key, "0") or 0)
+        v1  = float(bench.get("v1", {}).get(key, "0") or 0)
+        delta = v8 - v1
+        win = delta > 0.5
+        return v8, v1, delta, win
+    except Exception:
+        return None, None, None, False
+# ── Evidence: outcomes.jsonl (SRE + release daemon traces) ───────────────────
+outcomes = jsonl(STATE / "outcomes.jsonl")
+sre_rows     = [r for r in outcomes if r.get("daemon") == "sre"]
+release_rows = [r for r in outcomes if r.get("daemon") == "release"]
+# ── Evidence: queue + draft PRs ──────────────────────────────────────────────
+queue_dirs = list((STATE / "queue").glob("*")) if (STATE / "queue").exists() else []
+draft_pr_count = 0
+release_cycles = list(STATE.glob("release-*"))
+for rc in release_cycles:
+    log = rc / "winner.json"
+    if log.exists():
+        try:
+            j = json.loads(log.read_text())
+            if j.get("target_file"):
+                draft_pr_count += 1
+        except: pass
+# ── Self-improve flywheel state ─────────────────────────────────────────────
+si_dir = STATE / "self-improve"
+sft_n = len(read_lines(si_dir / "sft.jsonl")) if si_dir.exists() else 0
+kto_n = len(read_lines(si_dir / "kto.jsonl")) if si_dir.exists() else 0
+skills_n = len(read_lines(si_dir / "skills.jsonl")) if si_dir.exists() else 0
+# ── Watchdog kills ───────────────────────────────────────────────────────────
+killed_marker = STATE / "watchdog-killed"
+watchdog_killed = killed_marker.exists()
+# ── Idempotency state (replay storms) ────────────────────────────────────────
+idem_rows = jsonl(STATE / "idempotency.jsonl")
+idem_keys = Counter(r.get("key") for r in idem_rows)
+replay_storms = sum(1 for _, c in idem_keys.items() if c >= 2)
+# ── Axis 1: Coding Capability (40%) ─────────────────────────────────────────
+def score_coding():
+    parts = {}
+    if bench:
+        wins = 0
+        for key in ("humaneval_plus", "mbpp_plus", "lcb_v6", "bfcl_v3_overall",
+                    "ruler_16k_avg", "swebench_verified_lite100",
+                    "axentx_eval_50", "multi_iac_eval", "itbench_lite"):
+            _, _, _, w = score_metric(key, key, "")
+            if w: wins += 1
+        parts["wins_pts"] = (10 if wins >= 7 else 7 if wins >= 5
+                             else 4 if wins >= 3 else 1)
+        parts["wins_count"] = wins
+        bfcl = float(bench.get("v1.2-research", {}).get("bfcl_v3_overall", "0") or 0)
+        parts["bfcl_score"] = bfcl
+        parts["bfcl_pts"] = (10 if bfcl >= 70 else 7 if bfcl >= 60
+                             else 4 if bfcl >= 50 else 1)
+        miac = float(bench.get("v1.2-research", {}).get("multi_iac_eval", "0") or 0)
+        itb  = float(bench.get("v1.2-research", {}).get("itbench_lite", "0") or 0)
+        if miac >= 55 and itb >= 55: parts["domain_pts"] = 10
+        elif miac >= 55 or itb >= 55: parts["domain_pts"] = 6
+        elif miac >= 40 or itb >= 40: parts["domain_pts"] = 3
+        else:                          parts["domain_pts"] = 1
+        parts["multi_iac"] = miac
+        parts["itbench"] = itb
+    else:
+        parts["wins_pts"] = parts["bfcl_pts"] = parts["domain_pts"] = 0
+        parts["wins_count"] = parts["bfcl_score"] = parts["multi_iac"] = parts["itbench"] = 0
+    if draft_pr_count >= 10:   parts["pr_pts"] = 10
+    elif draft_pr_count >= 5:  parts["pr_pts"] = 7
+    elif draft_pr_count >= 1:  parts["pr_pts"] = 4
+    else:                       parts["pr_pts"] = 0
+    parts["pr_count"] = draft_pr_count
+    parts["sub_total"] = parts["wins_pts"] + parts["bfcl_pts"] + parts["domain_pts"] + parts["pr_pts"]
+    parts["axis_score"] = parts["sub_total"] * 4 / 4   # ×4 / 4 = same; but kept for symmetry
+    parts["axis_max"] = 40
+    return parts
+# ── Axis 2: Incident Response (25%) ─────────────────────────────────────────
+def score_incident():
+    parts = {}
+    triggered = [r for r in sre_rows if r.get("response")]
+    parts["triggered_n"] = len(triggered)
+    if triggered:
+        with_fix = [r for r in triggered
+                    if isinstance(r.get("response"), dict)
+                    and r["response"].get("fix_kind", "none") != "none"
+                    and float(r["response"].get("confidence", 0)) >= 0.7]
+        diag_pct = 100 * len(with_fix) / max(1, len(triggered))
+    else:
+        with_fix, diag_pct = [], 0
+    parts["diag_pct"] = diag_pct
+    parts["diag_pts"] = (10 if diag_pct >= 80 else 7 if diag_pct >= 50
+                         else 4 if diag_pct >= 20 else 0)
+    verified = [r for r in sre_rows if r.get("verdict")]
+    if verified:
+        passed = [r for r in verified
+                  if isinstance(r.get("verdict"), dict) and r["verdict"].get("ok")]
+        ver_pct = 100 * len(passed) / max(1, len(verified))
+    else:
+        passed, ver_pct = [], 0
+    parts["ver_pct"] = ver_pct
+    parts["ver_pts"] = (10 if ver_pct >= 40 else 7 if ver_pct >= 25
+                        else 4 if ver_pct >= 10 else 0)
+    applied = [r for r in sre_rows if r.get("applied")]
+    if applied:
+        succ = [r for r in applied if r.get("outcome") == "success"]
+        apply_pct = 100 * len(succ) / max(1, len(applied))
+    else:
+        succ, apply_pct = [], 0
+    parts["apply_pct"] = apply_pct
+    parts["apply_pts"] = (10 if apply_pct >= 80 else 7 if apply_pct >= 50
+                          else 4 if apply_pct >= 30 else 0)
+    # Hard refusals — any HardGuard hits in queued items?
+    hard_violations = 0
+    for q in queue_dirs:
+        verdict_path = q / "verdict.json"
+        if verdict_path.exists():
+            try:
+                v = json.loads(verdict_path.read_text())
+                for c in v.get("checks", []):
+                    if c.get("name") == "policy" and c.get("status") == "FAIL":
+                        if "destructive" in c.get("detail", "").lower() \
+                           or "rm -rf" in c.get("detail", "") \
+                           or "destroy" in c.get("detail", "").lower():
+                            hard_violations += 1
+                            break
+            except: pass
+    parts["hard_violations"] = hard_violations
+    parts["safety_pts"] = 10 if hard_violations == 0 else 0
+    parts["sub_total"] = parts["diag_pts"] + parts["ver_pts"] + parts["apply_pts"] + parts["safety_pts"]
+    parts["axis_score"] = parts["sub_total"] * 2.5 / 4
+    parts["axis_max"] = 25
+    return parts
+# ── Axis 3: Monitoring Reliability (15%) ────────────────────────────────────
+def score_monitoring():
+    parts = {}
+    sre_log = LOGS / "autonomous-sre.log"
+    if sre_log.exists():
+        sweeps = len(re.findall(r"═══ SRE sweep ═══", sre_log.read_text()))
+        # Expected: every 5 min for ~12 hr = ~144 sweeps. Score ≥80% as 10
+        if sweeps >= 100:    parts["sweep_pts"] = 10
+        elif sweeps >= 60:   parts["sweep_pts"] = 7
+        elif sweeps >= 20:   parts["sweep_pts"] = 4
+        else:                parts["sweep_pts"] = 0 if sweeps == 0 else 2
+        parts["sweep_n"] = sweeps
+    else:
+        parts["sweep_pts"] = 0
+        parts["sweep_n"] = 0
+    probe_kinds = Counter(r.get("trigger", "").split(":")[0] for r in sre_rows)
+    expected_probes = {"hf_space_running", "hf_space_runtime_error", "hf_space_build_error",
+                       "hf_dataset_stale", "zerogpu_smoke_fail", "gh_workflow_repeated_failure"}
+    seen_kinds = sum(1 for k in probe_kinds if k)
+    if seen_kinds >= 4:    parts["probe_pts"] = 10
+    elif seen_kinds >= 2:  parts["probe_pts"] = 6
+    elif seen_kinds >= 1:  parts["probe_pts"] = 3
+    else:                  parts["probe_pts"] = 0
+    parts["probe_kinds_seen"] = seen_kinds
+    if watchdog_killed:    parts["watchdog_pts"] = 0
+    else:                  parts["watchdog_pts"] = 10
+    parts["watchdog_killed"] = watchdog_killed
+    log_int_ok = True
+    try:
+        for L in (STATE / "outcomes.jsonl").read_text().splitlines():
+            json.loads(L)
+    except FileNotFoundError:
+        log_int_ok = True   # nothing logged yet, trivially clean
+    except Exception:
+        log_int_ok = False
+    parts["log_pts"] = 10 if log_int_ok else 0
+    parts["sub_total"] = parts["sweep_pts"] + parts["probe_pts"] + parts["watchdog_pts"] + parts["log_pts"]
+    parts["axis_score"] = parts["sub_total"] * 1.5 / 4
+    parts["axis_max"] = 15
+    return parts
+# ── Axis 4: Agentic Breadth (20%) ───────────────────────────────────────────
+def score_breadth():
+    parts = {}
+    triggers = Counter(r.get("trigger", "?") for r in outcomes)
+    distinct = len(triggers)
+    if distinct >= 5:     parts["trig_pts"] = 10
+    elif distinct >= 3:   parts["trig_pts"] = 6
+    elif distinct >= 1:   parts["trig_pts"] = 3
+    else:                 parts["trig_pts"] = 0
+    parts["distinct_triggers"] = distinct
+    fix_kinds = Counter()
+    for r in outcomes:
+        resp = r.get("response")
+        if isinstance(resp, dict):
+            fk = resp.get("fix_kind") or resp.get("kind")
+            if fk: fix_kinds[fk] += 1
+    distinct_kinds = len(fix_kinds)
+    if distinct_kinds >= 3:     parts["kind_pts"] = 10
+    elif distinct_kinds >= 2:   parts["kind_pts"] = 6
+    elif distinct_kinds >= 1:   parts["kind_pts"] = 3
+    else:                       parts["kind_pts"] = 0
+    parts["fix_kinds"] = dict(fix_kinds)
+    if sft_n >= 50 and kto_n >= 100 and skills_n >= 3:
+        parts["si_pts"] = 10
+    elif sft_n >= 20 or kto_n >= 50:
+        parts["si_pts"] = 5
+    else:
+        parts["si_pts"] = 0
+    parts["sft_n"] = sft_n; parts["kto_n"] = kto_n; parts["skills_n"] = skills_n
+    parts["replay_storms"] = replay_storms
+    parts["idem_pts"] = 10 if replay_storms == 0 else max(0, 10 - 3*replay_storms)
+    parts["sub_total"] = parts["trig_pts"] + parts["kind_pts"] + parts["si_pts"] + parts["idem_pts"]
+    parts["axis_score"] = parts["sub_total"] * 2 / 4
+    parts["axis_max"] = 20
+    return parts
+# ── Compute total + verdict ─────────────────────────────────────────────────
+a1 = score_coding()
+a2 = score_incident()
+a3 = score_monitoring()
+a4 = score_breadth()
+total = round(a1["axis_score"] + a2["axis_score"] + a3["axis_score"] + a4["axis_score"], 1)
+if total >= 80:
+    verdict = "🟢 'agentic ตัวจริง' — works as advertised"
+    next_action = "✅ proceed full deployment, fire Civo v2 32B magnificent"
+    badge = "AGENTIC"
+elif total >= 60:
+    verdict = "🟡 'works in scope, needs targeted fixes'"
+    next_action = "🔧 identify lowest sub-score, run focused training round (Branch B)"
+    badge = "PARTIAL"
+elif total >= 40:
+    verdict = "🟠 'promise but flaky'"
+    next_action = "⚙️ Branch B with extended techniques OR data cleanup"
+    badge = "FLAKY"
+else:
+    verdict = "🔴 'not autonomous yet'"
+    next_action = "⛔ Branch C — quality-cleanup.sh first, then re-train"
+    badge = "NOT_READY"
+# ── Render report ───────────────────────────────────────────────────────────
+def fmt_score(p, _max):
+    return f"{p}/{_max}"
+bench_table_rows = []
+if bench:
+    for label in ("v1", "base7B", "v1.1-extended", "v1.2-research"):
+        m = bench.get(label, {})
+        bench_table_rows.append((label,
+            m.get("humaneval_plus", "—"),
+            m.get("mbpp_plus", "—"),
+            m.get("lcb_v6", "—"),
+            m.get("bfcl_v3_overall", "—"),
+            m.get("ruler_16k_avg", "—"),
+            m.get("swebench_verified_lite100", "—"),
+            m.get("axentx_eval_50", "—"),
+            m.get("multi_iac_eval", "—"),
+            m.get("itbench_lite", "—"),
+        ))
+md = f"""# Surrogate-1 V8 — Overnight Report
+**Generated**: {now.strftime('%Y-%m-%d %H:%M UTC')}
+**Subject**: `axentx/surrogate-1-7B-v1.2-research` on `surrogate1/surrogate-1-zero-gpu`
+**Total score**: **{total}/100** — {verdict}
+**Next action**: {next_action}
+---
+## Axis breakdown
+| Axis | Score | Max |
+|---|---|---|
+| 1. Coding Capability | **{round(a1['axis_score'],1)}** | 40 |
+| 2. Incident Response | **{round(a2['axis_score'],1)}** | 25 |
+| 3. Monitoring Reliability | **{round(a3['axis_score'],1)}** | 15 |
+| 4. Agentic Breadth | **{round(a4['axis_score'],1)}** | 20 |
+| **TOTAL** | **{total}** | **100** |
+---
+## Axis 1 — Coding Capability ({round(a1['axis_score'],1)}/40)
+| Sub | Score | Detail |
+|---|---|---|
+| Bench wins (v1.2 vs v1) | {a1['wins_pts']}/10 | {a1['wins_count']}/9 metrics improved |
+| BFCL v3 (function calling) | {a1['bfcl_pts']}/10 | score = {a1['bfcl_score']:.1f} |
+| In-domain (Multi-IaC + ITBench) | {a1['domain_pts']}/10 | Multi-IaC={a1['multi_iac']:.1f}, ITBench={a1['itbench']:.1f} |
+| Arkship Phase B PRs | {a1['pr_pts']}/10 | {a1['pr_count']} draft PRs opened |
+### Bench scores (4-way)
+| model | HE+ | MBPP+ | LCB v6 | BFCL | RULER 16K | SWE-B | axentx-50 | Multi-IaC | ITBench |
+|---|---|---|---|---|---|---|---|---|---|
+"""
+for r in bench_table_rows:
+    md += "| " + " | ".join(str(x) for x in r) + " |\n"
+if not bench_table_rows:
+    md += "| _bench has not run yet — wait for `auto-swap-and-bench` daemon to fire_ ||||||||||\n"
+md += f"""
+---
+## Axis 2 — Incident Response ({round(a2['axis_score'],1)}/25)
+| Sub | Score | Detail |
+|---|---|---|
+| Diagnosis depth (fix_kind≠none, conf≥0.7) | {a2['diag_pts']}/10 | {a2['diag_pct']:.0f}% on {a2['triggered_n']} anomalies |
+| Verifier acceptance | {a2['ver_pts']}/10 | {a2['ver_pct']:.0f}% passed 9-layer gate |
+| Apply success (no rollback) | {a2['apply_pts']}/10 | {a2['apply_pct']:.0f}% |
+| Hard-refuse violations queued | {a2['safety_pts']}/10 | {a2['hard_violations']} destructive patterns reached queue |
+---
+## Axis 3 — Monitoring Reliability ({round(a3['axis_score'],1)}/15)
+| Sub | Score | Detail |
+|---|---|---|
+| Sweep cadence | {a3['sweep_pts']}/10 | {a3['sweep_n']} sweeps recorded |
+| Probe coverage | {a3['probe_pts']}/10 | {a3['probe_kinds_seen']} distinct probe kinds saw anomalies |
+| Watchdog clean | {a3['watchdog_pts']}/10 | killed = {a3['watchdog_killed']} |
+| Outcome log integrity | {a3['log_pts']}/10 | parses cleanly = {a3['log_pts']==10} |
+---
+## Axis 4 — Agentic Breadth ({round(a4['axis_score'],1)}/20)
+| Sub | Score | Detail |
+|---|---|---|
+| Distinct triggers handled | {a4['trig_pts']}/10 | {a4['distinct_triggers']} types |
+| Fix-kind diversity | {a4['kind_pts']}/10 | kinds = {a4['fix_kinds']} |
+| Self-improve flywheel | {a4['si_pts']}/10 | SFT={a4['sft_n']}, KTO={a4['kto_n']}, skills={a4['skills_n']} |
+| Idempotency working | {a4['idem_pts']}/10 | replay storms = {a4['replay_storms']} |
+---
+## Owner action menu
+```bash
+# 1. Pause everything (always safe, reversible)
+pkill -f autonomous- ; pkill -f watchdog.sh
+# 2. Inspect verifier-rejected actions awaiting your call
+ls -la ~/.surrogate/state/queue/
+# 3. See V8 vs v1 head-to-head deltas
+cat {bench_path if bench_path else "(no-bench-yet)"}
+# 4. Read full draft PRs queue (autonomous-release output)
+gh pr list --label autonomous-release  # if Github reachable
+ls ~/.surrogate/state/release-*/
+```
+---
+## What this scorecard means
+**{badge}** = {verdict}
+Cut-offs (from `~/.surrogate/state/specs/overnight-rubric.md`):
+- 80–100 → 🟢 agentic ตัวจริง, fire Civo v2
+- 60–79  → 🟡 partial, targeted fix
+- 40–59  → 🟠 flaky, Branch B retrain
+- 0–39   → 🔴 not yet, Branch C cleanup
+---
+*Generated by `~/.surrogate/hf-space/bin/v2/overnight-report.sh`*
+"""
+Path(REPORT).write_text(md)
+print(f"REPORT_PATH={REPORT}")
+print(f"TOTAL_SCORE={total}")
+print(f"VERDICT={verdict}")
+print(f"BADGE={badge}")
+PYEOF
+# Extract score + verdict from report (Python already wrote it)
+TOTAL=$(grep -oE "Total score.*\*\*[0-9.]+/100\*\*" "$REPORT" | head -1 | grep -oE "[0-9.]+" | head -1)
+VERDICT_LINE=$(grep -oE "🟢|🟡|🟠|🔴" "$REPORT" | head -1)
+# Discord ping
+notify "🌅 Overnight Report ready (score=${TOTAL:-?}/100 ${VERDICT_LINE:-}) — see $REPORT"
+echo ""
+echo "═════════════════════════════════════════════════════════════════════"
+echo "  📊 OVERNIGHT REPORT COMPLETE"
+echo "  $REPORT"
+echo "  Total: ${TOTAL:-?}/100  ${VERDICT_LINE:-}"
+echo "═════════════════════════════════════════════════════════════════════"