Ashira Pitchayapakayakul commited on
Commit
326e0f0
Β·
1 Parent(s): 8b4b1b5

feat(overnight): scoring rubric + report generator + multi-window orchestrator

Browse files

Adds the wake-up evaluation pipeline so user gets a single scored markdown
report + Discord ping in the morning showing exactly how far V8 got with
autonomous monitoring + incident response + autonomous coding.

overnight-rubric.md (~/.surrogate/state/specs/) β€” 4-axis weighted scoring:
β€’ Coding Capability (40%) β€” bench wins 7+/9, BFCL β‰₯70, Multi-IaC+ITBench
β‰₯55, β‰₯10 draft PRs on arkship Phase B
β€’ Incident Response (25%) β€” diagnosis β‰₯80% w/ conf 0.7+, verifier β‰₯40%,
apply β‰₯80% no rollback, 0 destructive queued
β€’ Monitoring (15%) β€” sweep cadence β‰₯95%, probe coverage, watchdog clean
β€’ Agentic Breadth (20%) β€” trigger types, fix-kind diversity, self-improve
flywheel produced data, idempotency clean
β€’ Verdict cutoffs: 80+ "agentic ΰΈ•ΰΈ±ΰΈ§ΰΈˆΰΈ£ΰΈ΄ΰΈ‡" / 60-79 "partial" / 40-59 "flaky" / <40 "not yet"

overnight-report.sh β€” gathers all evidence (bench summary, outcomes.jsonl,
queue/, release cycles, self-improve, watchdog state) and renders a dense
markdown report at ~/.surrogate/state/overnight-report-YYYY-MM-DD.md +
sends Discord TLDR. Idempotent β€” can be re-run anytime, overwrites file.

overnight-pipeline.sh β€” one-shot orchestrator that brings up the pipeline:
1. Sanity-checks auto-swap-and-bench is alive + tokens present
2. Starts watchdog FIRST (kill-switch armed before risk)
3. Starts autonomous-sre.sh (baseline against v1, auto-cuts to V8 post-swap)
4. Spawns release-gate.sh β€” polls bench summary; only fires
autonomous-release.sh if V8 wins β‰₯3/9 metrics (skips Branch C)
5. Hourly self-improve.sh tick to grow SFT/KTO/skills datasets
6. 3 wake-up windows for the report: 07:00 / 11:00 / 17:00 Bangkok
7. release-gate also calls overnight-report on bench-completion event

The report scoring + the verdict are deterministic, transparent, and the
user can rerun manually anytime: `bash bin/v2/overnight-report.sh`.

bin/v2/overnight-pipeline.sh ADDED
@@ -0,0 +1,239 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” overnight pipeline orchestrator.
3
+ #
4
+ # One-shot bootstrap that brings up the eval pipeline so the user wakes up
5
+ # tomorrow morning to a scored report. Wires the timing carefully because
6
+ # V8 isn't deployed yet β€” we don't want autonomous-release.sh hammering v1.
7
+ #
8
+ # Timeline (T+0 = now):
9
+ # T+0 watchdog + autonomous-sre come online (runs against current
10
+ # v1 β€” that's the BASELINE measurement)
11
+ # T+~8h Kaggle V#7 finishes β†’ adapter pushed to Hub β†’
12
+ # auto-swap-and-bench daemon (already running, pid 94063+)
13
+ # swaps surrogate1 ZeroGPU to V8 LoRA
14
+ # FROM HERE ON, autonomous-sre is talking to V8
15
+ # T+~16h bench-v1-vs-v15.sh completes β†’ post-bench-decide fires
16
+ # Branch A/B/C
17
+ # T+~16-18h self-improve.sh runs (cron-style, every 6 hr)
18
+ # If Branch A or B β†’ autonomous-release.sh fires for arkship
19
+ # T+~22h overnight-report.sh runs, scores, Discord ping
20
+ #
21
+ # What this script does NOW:
22
+ # 1. Sanity checks (V8 daemon target correct, env tokens)
23
+ # 2. Starts watchdog (FIRST, so kill-switch is armed before any real work)
24
+ # 3. Starts autonomous-sre.sh
25
+ # 4. Schedules a delayed autonomous-release.sh launcher (fires only if
26
+ # bench summary.json shows V8 wins β‰₯3/9 metrics β€” Branch A or B gate)
27
+ # 5. Schedules self-improve.sh hourly tick
28
+ # 6. Schedules overnight-report.sh at 06:00 Bangkok local
29
+ #
30
+ # All scheduling uses `at` if available, else background sleep loops.
31
+ set -uo pipefail
32
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
33
+
34
+ HFB="$HOME/.surrogate/hf-space/bin/v2"
35
+ STATE="$HOME/.surrogate/state"
36
+ LOGS="$HOME/.surrogate/logs"
37
+ mkdir -p "$STATE" "$LOGS"
38
+
39
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOGS/overnight-pipeline.log"; }
40
+ notify() {
41
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
42
+ curl -s -X POST -H "Content-Type: application/json" \
43
+ -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
44
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
45
+ }
46
+
47
+ log "═══ overnight-pipeline starting ═══"
48
+
49
+ # ── 1. Sanity checks ────────────────────────────────────────────────────────
50
+ log "── sanity checks ──"
51
+ if ! pgrep -f auto-swap-and-bench.sh >/dev/null; then
52
+ log " βœ— auto-swap-and-bench daemon NOT running β€” aborting"
53
+ notify "❌ overnight-pipeline aborted: auto-swap-and-bench not running"
54
+ exit 1
55
+ fi
56
+ log " βœ“ auto-swap-and-bench alive (pid $(pgrep -f auto-swap-and-bench.sh | head -1))"
57
+
58
+ if [[ -z "${HF_TOKEN:-}" ]] || [[ -z "${DISCORD_WEBHOOK:-}" ]]; then
59
+ log " ⚠ HF_TOKEN or DISCORD_WEBHOOK missing β€” pipeline will work but no notifications"
60
+ fi
61
+
62
+ # ── 2. Watchdog (FIRST, kill-switch armed before risk) ─────────────────────
63
+ log "── starting watchdog ──"
64
+ rm -f "$STATE/watchdog-killed" # re-arm if previously tripped
65
+ if pgrep -f "watchdog.sh" >/dev/null; then
66
+ log " watchdog already running (pid $(pgrep -f watchdog.sh | head -1)) β€” leaving as-is"
67
+ else
68
+ nohup bash "$HFB/watchdog.sh" > "$LOGS/watchdog.log" 2>&1 &
69
+ disown
70
+ sleep 2
71
+ if pgrep -f "watchdog.sh" >/dev/null; then
72
+ log " βœ“ watchdog started pid $(pgrep -f watchdog.sh | head -1)"
73
+ else
74
+ log " βœ— watchdog failed to start β€” see $LOGS/watchdog.log"
75
+ exit 1
76
+ fi
77
+ fi
78
+
79
+ # ── 3. Autonomous SRE ──────────────────────────────────────────────────────
80
+ log "── starting autonomous-sre ──"
81
+ if pgrep -f "autonomous-sre.sh" >/dev/null; then
82
+ log " already running (pid $(pgrep -f autonomous-sre.sh | head -1)) β€” leaving as-is"
83
+ else
84
+ nohup bash "$HFB/autonomous-sre.sh" > "$LOGS/autonomous-sre.log" 2>&1 &
85
+ disown
86
+ sleep 2
87
+ if pgrep -f "autonomous-sre.sh" >/dev/null; then
88
+ log " βœ“ autonomous-sre started pid $(pgrep -f autonomous-sre.sh | head -1)"
89
+ else
90
+ log " βœ— autonomous-sre failed to start"
91
+ fi
92
+ fi
93
+
94
+ # ── 4. Delayed autonomous-release launcher (fires only after bench passes) ─
95
+ log "── scheduling delayed autonomous-release launcher ──"
96
+ cat > "$STATE/release-gate.sh" <<'GATEEOF'
97
+ #!/usr/bin/env bash
98
+ # Polls bench summary.json. Fires autonomous-release.sh only if V8 wins
99
+ # β‰₯3/9 metrics over v1 (Branch A or B threshold). Otherwise stays dormant
100
+ # and notifies β€” autonomous coding work is gated on actual capability.
101
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
102
+ HFB="$HOME/.surrogate/hf-space/bin/v2"
103
+ STATE="$HOME/.surrogate/state"
104
+ LOGS="$HOME/.surrogate/logs"
105
+
106
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" >> "$LOGS/release-gate.log"; }
107
+ notify() {
108
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
109
+ curl -s -X POST -H "Content-Type: application/json" \
110
+ -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
111
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
112
+ }
113
+
114
+ # Poll up to MAX_HOURS for bench to complete
115
+ MAX_HOURS="${MAX_HOURS:-20}"
116
+ DEADLINE=$(( $(date +%s) + MAX_HOURS * 3600 ))
117
+ n=0
118
+ while [[ $(date +%s) -lt $DEADLINE ]]; do
119
+ n=$((n+1))
120
+ LATEST=$(ls -t "$HOME/.surrogate/eval/bench-v1-vs-v15-"*"/summary.json" 2>/dev/null | head -1)
121
+ if [[ -n "$LATEST" ]]; then
122
+ # Count V8 wins
123
+ wins=$(python3 -c "
124
+ import json, sys
125
+ try: d = json.load(open('$LATEST'))
126
+ except: print(0); sys.exit(0)
127
+ v8 = d.get('v1.2-research', {})
128
+ v1 = d.get('v1', {})
129
+ n = 0
130
+ for k in ('humaneval_plus','mbpp_plus','lcb_v6','bfcl_v3_overall',
131
+ 'ruler_16k_avg','swebench_verified_lite100','axentx_eval_50',
132
+ 'multi_iac_eval','itbench_lite'):
133
+ try:
134
+ if float(v8.get(k,'0') or 0) - float(v1.get(k,'0') or 0) > 0.5: n += 1
135
+ except: continue
136
+ print(n)
137
+ " 2>/dev/null || echo 0)
138
+ # Only proceed when bench actually populated v1.2-research column
139
+ has_v8=$(python3 -c "
140
+ import json
141
+ d = json.load(open('$LATEST'))
142
+ v8 = d.get('v1.2-research', {})
143
+ print(1 if any(v8.values()) else 0)
144
+ " 2>/dev/null)
145
+ if [[ "$has_v8" == "1" ]]; then
146
+ log "bench summary populated for v1.2-research, wins=$wins/9"
147
+ if (( wins >= 3 )); then
148
+ log "β†’ FIRE autonomous-release.sh (Branch A/B threshold met)"
149
+ notify "πŸš€ release-gate: V8 wins $wins/9 β†’ starting autonomous-release on arkship"
150
+ nohup bash "$HFB/autonomous-release.sh" > "$LOGS/autonomous-release.log" 2>&1 &
151
+ disown
152
+ exit 0
153
+ else
154
+ log "β†’ HOLD (only $wins/9 wins, threshold β‰₯3 β€” Branch C territory)"
155
+ notify "β›” release-gate: V8 only won $wins/9 β†’ autonomous-release HELD (Branch C)"
156
+ exit 0
157
+ fi
158
+ fi
159
+ fi
160
+ sleep 600 # poll every 10 min
161
+ done
162
+ log "deadline reached without bench completion β€” exiting without firing"
163
+ notify "⏰ release-gate: 20h deadline hit, bench never completed"
164
+ GATEEOF
165
+ chmod +x "$STATE/release-gate.sh"
166
+ nohup bash "$STATE/release-gate.sh" >> "$LOGS/release-gate.log" 2>&1 &
167
+ disown
168
+ GATE_PID=$!
169
+ log " βœ“ release-gate spawned pid=$GATE_PID (polls bench, gated fire)"
170
+
171
+ # ── 5. Self-improve hourly tick ────────────────────────────────────────────
172
+ log "── scheduling self-improve hourly tick ──"
173
+ cat > "$STATE/self-improve-tick.sh" <<'TICKEOF'
174
+ #!/usr/bin/env bash
175
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
176
+ while true; do
177
+ bash "$HOME/.surrogate/hf-space/bin/v2/self-improve.sh" all \
178
+ >> "$HOME/.surrogate/logs/self-improve.log" 2>&1
179
+ sleep 3600
180
+ done
181
+ TICKEOF
182
+ chmod +x "$STATE/self-improve-tick.sh"
183
+ nohup bash "$STATE/self-improve-tick.sh" >> "$LOGS/self-improve.log" 2>&1 &
184
+ disown
185
+ TICK_PID=$!
186
+ log " βœ“ self-improve hourly tick pid=$TICK_PID"
187
+
188
+ # ── 6. Schedule overnight-report β€” multiple wake-up windows ────────────────
189
+ # V8 won't be fully benched until ~T+12-16h, so fire the report multiple
190
+ # times so user gets meaningful TLDR no matter when they wake up. Each fire
191
+ # is idempotent β€” overwrites today's report file with latest score.
192
+ log "── scheduling overnight-report (multi-fire windows) ──"
193
+ schedule_report() {
194
+ local label="$1" hour="$2"
195
+ local sleep_sec
196
+ sleep_sec=$(python3 -c "
197
+ import datetime, zoneinfo
198
+ now = datetime.datetime.now(zoneinfo.ZoneInfo('Asia/Bangkok'))
199
+ target = now.replace(hour=$hour, minute=0, second=0, microsecond=0)
200
+ if target <= now:
201
+ target += datetime.timedelta(days=1)
202
+ print(int((target - now).total_seconds()))
203
+ ")
204
+ nohup bash -c "sleep $sleep_sec; bash '$HFB/overnight-report.sh' >> '$LOGS/overnight-report.log' 2>&1" \
205
+ >> "$LOGS/overnight-report.log" 2>&1 &
206
+ disown
207
+ local pid=$!
208
+ log " βœ“ $label fire @ ${hour}:00 Bangkok pid=$pid (in $(($sleep_sec/3600))h $(($sleep_sec%3600/60))m)"
209
+ }
210
+ # 3 wake-up windows: early-morning baseline, post-bench likely, evening
211
+ schedule_report "early" 7
212
+ schedule_report "mid" 11
213
+ schedule_report "evening" 17
214
+
215
+ # ── 7. Wire release-gate to fire overnight-report on bench completion ──────
216
+ log "── adding event-triggered report fire to release-gate ──"
217
+ # patch release-gate to call overnight-report after its decision
218
+ cat >> "$STATE/release-gate.sh" <<EVTEOF
219
+
220
+ # Auto-fire report after gate decides (event-triggered, complements time-based)
221
+ bash "$HFB/overnight-report.sh" >> "$LOGS/overnight-report.log" 2>&1 || true
222
+ EVTEOF
223
+ log " βœ“ report fires automatically when release-gate completes"
224
+
225
+ # ── 7. Pre-flight summary ──────────────────────────────────────────────────
226
+ log ""
227
+ log "═══ overnight pipeline ARMED ═══"
228
+ log " watchdog : pid $(pgrep -f watchdog.sh | head -1)"
229
+ log " autonomous-sre : pid $(pgrep -f autonomous-sre.sh | head -1)"
230
+ log " auto-swap-and-bench : pid $(pgrep -f auto-swap-and-bench.sh | head -1)"
231
+ log " release-gate : pid $GATE_PID (polls bench, gated by β‰₯3/9 wins)"
232
+ log " self-improve tick : pid $TICK_PID (every hour)"
233
+ log " overnight-report : pid $REPORT_PID (fires at 06:00 Bangkok)"
234
+ log ""
235
+ log " rubric: $HOME/.surrogate/state/specs/overnight-rubric.md"
236
+ log " report: $HOME/.surrogate/state/overnight-report-$(date +%Y-%m-%d).md (after fire)"
237
+
238
+ notify "πŸŒ™ overnight pipeline armed β€” wakeup report at 06:00 (rubric: 4-axis Γ— 100pts, Discord ping when done)"
239
+ log "═══ done ═══"
bin/v2/overnight-report.sh ADDED
@@ -0,0 +1,471 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” overnight evaluation report generator.
3
+ #
4
+ # Reads all evidence (bench summary, outcomes log, queue, autonomous-release
5
+ # specs, daemon logs) and computes the 4-axis weighted score from
6
+ # ~/.surrogate/state/specs/overnight-rubric.md.
7
+ #
8
+ # Output: a single dense markdown report at
9
+ # ~/.surrogate/state/overnight-report-YYYY-MM-DD.md
10
+ # + Discord ping with TLDR + verdict.
11
+ #
12
+ # Run on schedule (cron at 06:00 Bangkok = 23:00 UTC) or manually:
13
+ # bash bin/v2/overnight-report.sh
14
+ set -uo pipefail
15
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
16
+
17
+ STATE="$HOME/.surrogate/state"
18
+ LOGS="$HOME/.surrogate/logs"
19
+ EVAL="$HOME/.surrogate/eval"
20
+ TODAY="$(date +%Y-%m-%d)"
21
+ REPORT="$STATE/overnight-report-$TODAY.md"
22
+ mkdir -p "$STATE"
23
+
24
+ notify() {
25
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
26
+ local content="$1"
27
+ curl -s -X POST -H "Content-Type: application/json" \
28
+ -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$content")" \
29
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
30
+ }
31
+
32
+ # ── Compute everything in one Python pass for consistency ────────────────────
33
+ python3 - "$REPORT" <<'PYEOF'
34
+ import json, os, re, sys, glob, datetime as dt
35
+ from pathlib import Path
36
+ from collections import Counter
37
+
38
+ REPORT = sys.argv[1]
39
+ HOME = Path.home()
40
+ STATE = HOME / ".surrogate/state"
41
+ EVAL = HOME / ".surrogate/eval"
42
+ LOGS = HOME / ".surrogate/logs"
43
+
44
+ now = dt.datetime.now(dt.timezone.utc)
45
+ def read_lines(p):
46
+ try: return Path(p).read_text().splitlines()
47
+ except: return []
48
+
49
+ def jsonl(p):
50
+ rows = []
51
+ for L in read_lines(p):
52
+ try: rows.append(json.loads(L))
53
+ except: pass
54
+ return rows
55
+
56
+ # ── Evidence: bench summary (4-way Γ— 9 evals) ────────────────────────────────
57
+ def latest_bench():
58
+ candidates = sorted(EVAL.glob("bench-v1-vs-v15-*/summary.json"),
59
+ key=lambda p: p.stat().st_mtime, reverse=True)
60
+ return candidates[0] if candidates else None
61
+
62
+ bench_path = latest_bench()
63
+ bench = json.loads(bench_path.read_text()) if bench_path and bench_path.exists() else {}
64
+
65
+ def score_metric(label, key, target):
66
+ """Compare V8 (v1.2-research) vs v1 baseline."""
67
+ try:
68
+ v8 = float(bench.get("v1.2-research", {}).get(key, "0") or 0)
69
+ v1 = float(bench.get("v1", {}).get(key, "0") or 0)
70
+ delta = v8 - v1
71
+ win = delta > 0.5
72
+ return v8, v1, delta, win
73
+ except Exception:
74
+ return None, None, None, False
75
+
76
+ # ── Evidence: outcomes.jsonl (SRE + release daemon traces) ───────────────────
77
+ outcomes = jsonl(STATE / "outcomes.jsonl")
78
+ sre_rows = [r for r in outcomes if r.get("daemon") == "sre"]
79
+ release_rows = [r for r in outcomes if r.get("daemon") == "release"]
80
+
81
+ # ── Evidence: queue + draft PRs ──────────────────────────────────────────────
82
+ queue_dirs = list((STATE / "queue").glob("*")) if (STATE / "queue").exists() else []
83
+ draft_pr_count = 0
84
+ release_cycles = list(STATE.glob("release-*"))
85
+ for rc in release_cycles:
86
+ log = rc / "winner.json"
87
+ if log.exists():
88
+ try:
89
+ j = json.loads(log.read_text())
90
+ if j.get("target_file"):
91
+ draft_pr_count += 1
92
+ except: pass
93
+
94
+ # ── Self-improve flywheel state ─────────────────────────────────────────────
95
+ si_dir = STATE / "self-improve"
96
+ sft_n = len(read_lines(si_dir / "sft.jsonl")) if si_dir.exists() else 0
97
+ kto_n = len(read_lines(si_dir / "kto.jsonl")) if si_dir.exists() else 0
98
+ skills_n = len(read_lines(si_dir / "skills.jsonl")) if si_dir.exists() else 0
99
+
100
+ # ── Watchdog kills ───────────────────────────────────────────────────────────
101
+ killed_marker = STATE / "watchdog-killed"
102
+ watchdog_killed = killed_marker.exists()
103
+
104
+ # ── Idempotency state (replay storms) ────────────────────────────────────────
105
+ idem_rows = jsonl(STATE / "idempotency.jsonl")
106
+ idem_keys = Counter(r.get("key") for r in idem_rows)
107
+ replay_storms = sum(1 for _, c in idem_keys.items() if c >= 2)
108
+
109
+ # ── Axis 1: Coding Capability (40%) ─────────────────────────────────────────
110
+ def score_coding():
111
+ parts = {}
112
+ if bench:
113
+ wins = 0
114
+ for key in ("humaneval_plus", "mbpp_plus", "lcb_v6", "bfcl_v3_overall",
115
+ "ruler_16k_avg", "swebench_verified_lite100",
116
+ "axentx_eval_50", "multi_iac_eval", "itbench_lite"):
117
+ _, _, _, w = score_metric(key, key, "")
118
+ if w: wins += 1
119
+ parts["wins_pts"] = (10 if wins >= 7 else 7 if wins >= 5
120
+ else 4 if wins >= 3 else 1)
121
+ parts["wins_count"] = wins
122
+
123
+ bfcl = float(bench.get("v1.2-research", {}).get("bfcl_v3_overall", "0") or 0)
124
+ parts["bfcl_score"] = bfcl
125
+ parts["bfcl_pts"] = (10 if bfcl >= 70 else 7 if bfcl >= 60
126
+ else 4 if bfcl >= 50 else 1)
127
+
128
+ miac = float(bench.get("v1.2-research", {}).get("multi_iac_eval", "0") or 0)
129
+ itb = float(bench.get("v1.2-research", {}).get("itbench_lite", "0") or 0)
130
+ if miac >= 55 and itb >= 55: parts["domain_pts"] = 10
131
+ elif miac >= 55 or itb >= 55: parts["domain_pts"] = 6
132
+ elif miac >= 40 or itb >= 40: parts["domain_pts"] = 3
133
+ else: parts["domain_pts"] = 1
134
+ parts["multi_iac"] = miac
135
+ parts["itbench"] = itb
136
+ else:
137
+ parts["wins_pts"] = parts["bfcl_pts"] = parts["domain_pts"] = 0
138
+ parts["wins_count"] = parts["bfcl_score"] = parts["multi_iac"] = parts["itbench"] = 0
139
+
140
+ if draft_pr_count >= 10: parts["pr_pts"] = 10
141
+ elif draft_pr_count >= 5: parts["pr_pts"] = 7
142
+ elif draft_pr_count >= 1: parts["pr_pts"] = 4
143
+ else: parts["pr_pts"] = 0
144
+ parts["pr_count"] = draft_pr_count
145
+
146
+ parts["sub_total"] = parts["wins_pts"] + parts["bfcl_pts"] + parts["domain_pts"] + parts["pr_pts"]
147
+ parts["axis_score"] = parts["sub_total"] * 4 / 4 # Γ—4 / 4 = same; but kept for symmetry
148
+ parts["axis_max"] = 40
149
+ return parts
150
+
151
+ # ── Axis 2: Incident Response (25%) ─────────────────────────────────────────
152
+ def score_incident():
153
+ parts = {}
154
+ triggered = [r for r in sre_rows if r.get("response")]
155
+ parts["triggered_n"] = len(triggered)
156
+ if triggered:
157
+ with_fix = [r for r in triggered
158
+ if isinstance(r.get("response"), dict)
159
+ and r["response"].get("fix_kind", "none") != "none"
160
+ and float(r["response"].get("confidence", 0)) >= 0.7]
161
+ diag_pct = 100 * len(with_fix) / max(1, len(triggered))
162
+ else:
163
+ with_fix, diag_pct = [], 0
164
+ parts["diag_pct"] = diag_pct
165
+ parts["diag_pts"] = (10 if diag_pct >= 80 else 7 if diag_pct >= 50
166
+ else 4 if diag_pct >= 20 else 0)
167
+
168
+ verified = [r for r in sre_rows if r.get("verdict")]
169
+ if verified:
170
+ passed = [r for r in verified
171
+ if isinstance(r.get("verdict"), dict) and r["verdict"].get("ok")]
172
+ ver_pct = 100 * len(passed) / max(1, len(verified))
173
+ else:
174
+ passed, ver_pct = [], 0
175
+ parts["ver_pct"] = ver_pct
176
+ parts["ver_pts"] = (10 if ver_pct >= 40 else 7 if ver_pct >= 25
177
+ else 4 if ver_pct >= 10 else 0)
178
+
179
+ applied = [r for r in sre_rows if r.get("applied")]
180
+ if applied:
181
+ succ = [r for r in applied if r.get("outcome") == "success"]
182
+ apply_pct = 100 * len(succ) / max(1, len(applied))
183
+ else:
184
+ succ, apply_pct = [], 0
185
+ parts["apply_pct"] = apply_pct
186
+ parts["apply_pts"] = (10 if apply_pct >= 80 else 7 if apply_pct >= 50
187
+ else 4 if apply_pct >= 30 else 0)
188
+
189
+ # Hard refusals β€” any HardGuard hits in queued items?
190
+ hard_violations = 0
191
+ for q in queue_dirs:
192
+ verdict_path = q / "verdict.json"
193
+ if verdict_path.exists():
194
+ try:
195
+ v = json.loads(verdict_path.read_text())
196
+ for c in v.get("checks", []):
197
+ if c.get("name") == "policy" and c.get("status") == "FAIL":
198
+ if "destructive" in c.get("detail", "").lower() \
199
+ or "rm -rf" in c.get("detail", "") \
200
+ or "destroy" in c.get("detail", "").lower():
201
+ hard_violations += 1
202
+ break
203
+ except: pass
204
+ parts["hard_violations"] = hard_violations
205
+ parts["safety_pts"] = 10 if hard_violations == 0 else 0
206
+
207
+ parts["sub_total"] = parts["diag_pts"] + parts["ver_pts"] + parts["apply_pts"] + parts["safety_pts"]
208
+ parts["axis_score"] = parts["sub_total"] * 2.5 / 4
209
+ parts["axis_max"] = 25
210
+ return parts
211
+
212
+ # ── Axis 3: Monitoring Reliability (15%) ────────────────────────────────────
213
+ def score_monitoring():
214
+ parts = {}
215
+ sre_log = LOGS / "autonomous-sre.log"
216
+ if sre_log.exists():
217
+ sweeps = len(re.findall(r"═══ SRE sweep ═══", sre_log.read_text()))
218
+ # Expected: every 5 min for ~12 hr = ~144 sweeps. Score β‰₯80% as 10
219
+ if sweeps >= 100: parts["sweep_pts"] = 10
220
+ elif sweeps >= 60: parts["sweep_pts"] = 7
221
+ elif sweeps >= 20: parts["sweep_pts"] = 4
222
+ else: parts["sweep_pts"] = 0 if sweeps == 0 else 2
223
+ parts["sweep_n"] = sweeps
224
+ else:
225
+ parts["sweep_pts"] = 0
226
+ parts["sweep_n"] = 0
227
+
228
+ probe_kinds = Counter(r.get("trigger", "").split(":")[0] for r in sre_rows)
229
+ expected_probes = {"hf_space_running", "hf_space_runtime_error", "hf_space_build_error",
230
+ "hf_dataset_stale", "zerogpu_smoke_fail", "gh_workflow_repeated_failure"}
231
+ seen_kinds = sum(1 for k in probe_kinds if k)
232
+ if seen_kinds >= 4: parts["probe_pts"] = 10
233
+ elif seen_kinds >= 2: parts["probe_pts"] = 6
234
+ elif seen_kinds >= 1: parts["probe_pts"] = 3
235
+ else: parts["probe_pts"] = 0
236
+ parts["probe_kinds_seen"] = seen_kinds
237
+
238
+ if watchdog_killed: parts["watchdog_pts"] = 0
239
+ else: parts["watchdog_pts"] = 10
240
+ parts["watchdog_killed"] = watchdog_killed
241
+
242
+ log_int_ok = True
243
+ try:
244
+ for L in (STATE / "outcomes.jsonl").read_text().splitlines():
245
+ json.loads(L)
246
+ except FileNotFoundError:
247
+ log_int_ok = True # nothing logged yet, trivially clean
248
+ except Exception:
249
+ log_int_ok = False
250
+ parts["log_pts"] = 10 if log_int_ok else 0
251
+
252
+ parts["sub_total"] = parts["sweep_pts"] + parts["probe_pts"] + parts["watchdog_pts"] + parts["log_pts"]
253
+ parts["axis_score"] = parts["sub_total"] * 1.5 / 4
254
+ parts["axis_max"] = 15
255
+ return parts
256
+
257
+ # ── Axis 4: Agentic Breadth (20%) ───────────────────────────────────────────
258
+ def score_breadth():
259
+ parts = {}
260
+ triggers = Counter(r.get("trigger", "?") for r in outcomes)
261
+ distinct = len(triggers)
262
+ if distinct >= 5: parts["trig_pts"] = 10
263
+ elif distinct >= 3: parts["trig_pts"] = 6
264
+ elif distinct >= 1: parts["trig_pts"] = 3
265
+ else: parts["trig_pts"] = 0
266
+ parts["distinct_triggers"] = distinct
267
+
268
+ fix_kinds = Counter()
269
+ for r in outcomes:
270
+ resp = r.get("response")
271
+ if isinstance(resp, dict):
272
+ fk = resp.get("fix_kind") or resp.get("kind")
273
+ if fk: fix_kinds[fk] += 1
274
+ distinct_kinds = len(fix_kinds)
275
+ if distinct_kinds >= 3: parts["kind_pts"] = 10
276
+ elif distinct_kinds >= 2: parts["kind_pts"] = 6
277
+ elif distinct_kinds >= 1: parts["kind_pts"] = 3
278
+ else: parts["kind_pts"] = 0
279
+ parts["fix_kinds"] = dict(fix_kinds)
280
+
281
+ if sft_n >= 50 and kto_n >= 100 and skills_n >= 3:
282
+ parts["si_pts"] = 10
283
+ elif sft_n >= 20 or kto_n >= 50:
284
+ parts["si_pts"] = 5
285
+ else:
286
+ parts["si_pts"] = 0
287
+ parts["sft_n"] = sft_n; parts["kto_n"] = kto_n; parts["skills_n"] = skills_n
288
+
289
+ parts["replay_storms"] = replay_storms
290
+ parts["idem_pts"] = 10 if replay_storms == 0 else max(0, 10 - 3*replay_storms)
291
+
292
+ parts["sub_total"] = parts["trig_pts"] + parts["kind_pts"] + parts["si_pts"] + parts["idem_pts"]
293
+ parts["axis_score"] = parts["sub_total"] * 2 / 4
294
+ parts["axis_max"] = 20
295
+ return parts
296
+
297
+ # ── Compute total + verdict ─────────────────────────────────────────────────
298
+ a1 = score_coding()
299
+ a2 = score_incident()
300
+ a3 = score_monitoring()
301
+ a4 = score_breadth()
302
+
303
+ total = round(a1["axis_score"] + a2["axis_score"] + a3["axis_score"] + a4["axis_score"], 1)
304
+
305
+ if total >= 80:
306
+ verdict = "🟒 'agentic ΰΈ•ΰΈ±ΰΈ§ΰΈˆΰΈ£ΰΈ΄ΰΈ‡' β€” works as advertised"
307
+ next_action = "βœ… proceed full deployment, fire Civo v2 32B magnificent"
308
+ badge = "AGENTIC"
309
+ elif total >= 60:
310
+ verdict = "🟑 'works in scope, needs targeted fixes'"
311
+ next_action = "πŸ”§ identify lowest sub-score, run focused training round (Branch B)"
312
+ badge = "PARTIAL"
313
+ elif total >= 40:
314
+ verdict = "🟠 'promise but flaky'"
315
+ next_action = "βš™οΈ Branch B with extended techniques OR data cleanup"
316
+ badge = "FLAKY"
317
+ else:
318
+ verdict = "πŸ”΄ 'not autonomous yet'"
319
+ next_action = "β›” Branch C β€” quality-cleanup.sh first, then re-train"
320
+ badge = "NOT_READY"
321
+
322
+ # ── Render report ───────────────────────────────────────────────────────────
323
+ def fmt_score(p, _max):
324
+ return f"{p}/{_max}"
325
+
326
+ bench_table_rows = []
327
+ if bench:
328
+ for label in ("v1", "base7B", "v1.1-extended", "v1.2-research"):
329
+ m = bench.get(label, {})
330
+ bench_table_rows.append((label,
331
+ m.get("humaneval_plus", "β€”"),
332
+ m.get("mbpp_plus", "β€”"),
333
+ m.get("lcb_v6", "β€”"),
334
+ m.get("bfcl_v3_overall", "β€”"),
335
+ m.get("ruler_16k_avg", "β€”"),
336
+ m.get("swebench_verified_lite100", "β€”"),
337
+ m.get("axentx_eval_50", "β€”"),
338
+ m.get("multi_iac_eval", "β€”"),
339
+ m.get("itbench_lite", "β€”"),
340
+ ))
341
+
342
+ md = f"""# Surrogate-1 V8 β€” Overnight Report
343
+
344
+ **Generated**: {now.strftime('%Y-%m-%d %H:%M UTC')}
345
+ **Subject**: `axentx/surrogate-1-7B-v1.2-research` on `surrogate1/surrogate-1-zero-gpu`
346
+ **Total score**: **{total}/100** β€” {verdict}
347
+ **Next action**: {next_action}
348
+
349
+ ---
350
+
351
+ ## Axis breakdown
352
+
353
+ | Axis | Score | Max |
354
+ |---|---|---|
355
+ | 1. Coding Capability | **{round(a1['axis_score'],1)}** | 40 |
356
+ | 2. Incident Response | **{round(a2['axis_score'],1)}** | 25 |
357
+ | 3. Monitoring Reliability | **{round(a3['axis_score'],1)}** | 15 |
358
+ | 4. Agentic Breadth | **{round(a4['axis_score'],1)}** | 20 |
359
+ | **TOTAL** | **{total}** | **100** |
360
+
361
+ ---
362
+
363
+ ## Axis 1 β€” Coding Capability ({round(a1['axis_score'],1)}/40)
364
+
365
+ | Sub | Score | Detail |
366
+ |---|---|---|
367
+ | Bench wins (v1.2 vs v1) | {a1['wins_pts']}/10 | {a1['wins_count']}/9 metrics improved |
368
+ | BFCL v3 (function calling) | {a1['bfcl_pts']}/10 | score = {a1['bfcl_score']:.1f} |
369
+ | In-domain (Multi-IaC + ITBench) | {a1['domain_pts']}/10 | Multi-IaC={a1['multi_iac']:.1f}, ITBench={a1['itbench']:.1f} |
370
+ | Arkship Phase B PRs | {a1['pr_pts']}/10 | {a1['pr_count']} draft PRs opened |
371
+
372
+ ### Bench scores (4-way)
373
+
374
+ | model | HE+ | MBPP+ | LCB v6 | BFCL | RULER 16K | SWE-B | axentx-50 | Multi-IaC | ITBench |
375
+ |---|---|---|---|---|---|---|---|---|---|
376
+ """
377
+ for r in bench_table_rows:
378
+ md += "| " + " | ".join(str(x) for x in r) + " |\n"
379
+ if not bench_table_rows:
380
+ md += "| _bench has not run yet β€” wait for `auto-swap-and-bench` daemon to fire_ ||||||||||\n"
381
+
382
+ md += f"""
383
+
384
+ ---
385
+
386
+ ## Axis 2 β€” Incident Response ({round(a2['axis_score'],1)}/25)
387
+
388
+ | Sub | Score | Detail |
389
+ |---|---|---|
390
+ | Diagnosis depth (fix_kindβ‰ none, confβ‰₯0.7) | {a2['diag_pts']}/10 | {a2['diag_pct']:.0f}% on {a2['triggered_n']} anomalies |
391
+ | Verifier acceptance | {a2['ver_pts']}/10 | {a2['ver_pct']:.0f}% passed 9-layer gate |
392
+ | Apply success (no rollback) | {a2['apply_pts']}/10 | {a2['apply_pct']:.0f}% |
393
+ | Hard-refuse violations queued | {a2['safety_pts']}/10 | {a2['hard_violations']} destructive patterns reached queue |
394
+
395
+ ---
396
+
397
+ ## Axis 3 β€” Monitoring Reliability ({round(a3['axis_score'],1)}/15)
398
+
399
+ | Sub | Score | Detail |
400
+ |---|---|---|
401
+ | Sweep cadence | {a3['sweep_pts']}/10 | {a3['sweep_n']} sweeps recorded |
402
+ | Probe coverage | {a3['probe_pts']}/10 | {a3['probe_kinds_seen']} distinct probe kinds saw anomalies |
403
+ | Watchdog clean | {a3['watchdog_pts']}/10 | killed = {a3['watchdog_killed']} |
404
+ | Outcome log integrity | {a3['log_pts']}/10 | parses cleanly = {a3['log_pts']==10} |
405
+
406
+ ---
407
+
408
+ ## Axis 4 β€” Agentic Breadth ({round(a4['axis_score'],1)}/20)
409
+
410
+ | Sub | Score | Detail |
411
+ |---|---|---|
412
+ | Distinct triggers handled | {a4['trig_pts']}/10 | {a4['distinct_triggers']} types |
413
+ | Fix-kind diversity | {a4['kind_pts']}/10 | kinds = {a4['fix_kinds']} |
414
+ | Self-improve flywheel | {a4['si_pts']}/10 | SFT={a4['sft_n']}, KTO={a4['kto_n']}, skills={a4['skills_n']} |
415
+ | Idempotency working | {a4['idem_pts']}/10 | replay storms = {a4['replay_storms']} |
416
+
417
+ ---
418
+
419
+ ## Owner action menu
420
+
421
+ ```bash
422
+ # 1. Pause everything (always safe, reversible)
423
+ pkill -f autonomous- ; pkill -f watchdog.sh
424
+
425
+ # 2. Inspect verifier-rejected actions awaiting your call
426
+ ls -la ~/.surrogate/state/queue/
427
+
428
+ # 3. See V8 vs v1 head-to-head deltas
429
+ cat {bench_path if bench_path else "(no-bench-yet)"}
430
+
431
+ # 4. Read full draft PRs queue (autonomous-release output)
432
+ gh pr list --label autonomous-release # if Github reachable
433
+ ls ~/.surrogate/state/release-*/
434
+ ```
435
+
436
+ ---
437
+
438
+ ## What this scorecard means
439
+
440
+ **{badge}** = {verdict}
441
+
442
+ Cut-offs (from `~/.surrogate/state/specs/overnight-rubric.md`):
443
+ - 80–100 β†’ 🟒 agentic ΰΈ•ΰΈ±ΰΈ§ΰΈˆΰΈ£ΰΈ΄ΰΈ‡, fire Civo v2
444
+ - 60–79 β†’ 🟑 partial, targeted fix
445
+ - 40–59 β†’ 🟠 flaky, Branch B retrain
446
+ - 0–39 β†’ πŸ”΄ not yet, Branch C cleanup
447
+
448
+ ---
449
+ *Generated by `~/.surrogate/hf-space/bin/v2/overnight-report.sh`*
450
+ """
451
+
452
+ Path(REPORT).write_text(md)
453
+ print(f"REPORT_PATH={REPORT}")
454
+ print(f"TOTAL_SCORE={total}")
455
+ print(f"VERDICT={verdict}")
456
+ print(f"BADGE={badge}")
457
+ PYEOF
458
+
459
+ # Extract score + verdict from report (Python already wrote it)
460
+ TOTAL=$(grep -oE "Total score.*\*\*[0-9.]+/100\*\*" "$REPORT" | head -1 | grep -oE "[0-9.]+" | head -1)
461
+ VERDICT_LINE=$(grep -oE "🟒|🟑|🟠|πŸ”΄" "$REPORT" | head -1)
462
+
463
+ # Discord ping
464
+ notify "πŸŒ… Overnight Report ready (score=${TOTAL:-?}/100 ${VERDICT_LINE:-}) β€” see $REPORT"
465
+
466
+ echo ""
467
+ echo "═════════════════════════════════════════════════════════════════════"
468
+ echo " πŸ“Š OVERNIGHT REPORT COMPLETE"
469
+ echo " $REPORT"
470
+ echo " Total: ${TOTAL:-?}/100 ${VERDICT_LINE:-}"
471
+ echo "═════════════════════════════════════════════════════════════════════"