Spaces:
Runtime error
feat(overnight): scoring rubric + report generator + multi-window orchestrator
Browse filesAdds the wake-up evaluation pipeline so user gets a single scored markdown
report + Discord ping in the morning showing exactly how far V8 got with
autonomous monitoring + incident response + autonomous coding.
overnight-rubric.md (~/.surrogate/state/specs/) β 4-axis weighted scoring:
β’ Coding Capability (40%) β bench wins 7+/9, BFCL β₯70, Multi-IaC+ITBench
β₯55, β₯10 draft PRs on arkship Phase B
β’ Incident Response (25%) β diagnosis β₯80% w/ conf 0.7+, verifier β₯40%,
apply β₯80% no rollback, 0 destructive queued
β’ Monitoring (15%) β sweep cadence β₯95%, probe coverage, watchdog clean
β’ Agentic Breadth (20%) β trigger types, fix-kind diversity, self-improve
flywheel produced data, idempotency clean
β’ Verdict cutoffs: 80+ "agentic ΰΈΰΈ±ΰΈ§ΰΈΰΈ£ΰΈ΄ΰΈ" / 60-79 "partial" / 40-59 "flaky" / <40 "not yet"
overnight-report.sh β gathers all evidence (bench summary, outcomes.jsonl,
queue/, release cycles, self-improve, watchdog state) and renders a dense
markdown report at ~/.surrogate/state/overnight-report-YYYY-MM-DD.md +
sends Discord TLDR. Idempotent β can be re-run anytime, overwrites file.
overnight-pipeline.sh β one-shot orchestrator that brings up the pipeline:
1. Sanity-checks auto-swap-and-bench is alive + tokens present
2. Starts watchdog FIRST (kill-switch armed before risk)
3. Starts autonomous-sre.sh (baseline against v1, auto-cuts to V8 post-swap)
4. Spawns release-gate.sh β polls bench summary; only fires
autonomous-release.sh if V8 wins β₯3/9 metrics (skips Branch C)
5. Hourly self-improve.sh tick to grow SFT/KTO/skills datasets
6. 3 wake-up windows for the report: 07:00 / 11:00 / 17:00 Bangkok
7. release-gate also calls overnight-report on bench-completion event
The report scoring + the verdict are deterministic, transparent, and the
user can rerun manually anytime: `bash bin/v2/overnight-report.sh`.
- bin/v2/overnight-pipeline.sh +239 -0
- bin/v2/overnight-report.sh +471 -0
|
@@ -0,0 +1,239 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 β overnight pipeline orchestrator.
|
| 3 |
+
#
|
| 4 |
+
# One-shot bootstrap that brings up the eval pipeline so the user wakes up
|
| 5 |
+
# tomorrow morning to a scored report. Wires the timing carefully because
|
| 6 |
+
# V8 isn't deployed yet β we don't want autonomous-release.sh hammering v1.
|
| 7 |
+
#
|
| 8 |
+
# Timeline (T+0 = now):
|
| 9 |
+
# T+0 watchdog + autonomous-sre come online (runs against current
|
| 10 |
+
# v1 β that's the BASELINE measurement)
|
| 11 |
+
# T+~8h Kaggle V#7 finishes β adapter pushed to Hub β
|
| 12 |
+
# auto-swap-and-bench daemon (already running, pid 94063+)
|
| 13 |
+
# swaps surrogate1 ZeroGPU to V8 LoRA
|
| 14 |
+
# FROM HERE ON, autonomous-sre is talking to V8
|
| 15 |
+
# T+~16h bench-v1-vs-v15.sh completes β post-bench-decide fires
|
| 16 |
+
# Branch A/B/C
|
| 17 |
+
# T+~16-18h self-improve.sh runs (cron-style, every 6 hr)
|
| 18 |
+
# If Branch A or B β autonomous-release.sh fires for arkship
|
| 19 |
+
# T+~22h overnight-report.sh runs, scores, Discord ping
|
| 20 |
+
#
|
| 21 |
+
# What this script does NOW:
|
| 22 |
+
# 1. Sanity checks (V8 daemon target correct, env tokens)
|
| 23 |
+
# 2. Starts watchdog (FIRST, so kill-switch is armed before any real work)
|
| 24 |
+
# 3. Starts autonomous-sre.sh
|
| 25 |
+
# 4. Schedules a delayed autonomous-release.sh launcher (fires only if
|
| 26 |
+
# bench summary.json shows V8 wins β₯3/9 metrics β Branch A or B gate)
|
| 27 |
+
# 5. Schedules self-improve.sh hourly tick
|
| 28 |
+
# 6. Schedules overnight-report.sh at 06:00 Bangkok local
|
| 29 |
+
#
|
| 30 |
+
# All scheduling uses `at` if available, else background sleep loops.
|
| 31 |
+
set -uo pipefail
|
| 32 |
+
[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
|
| 33 |
+
|
| 34 |
+
HFB="$HOME/.surrogate/hf-space/bin/v2"
|
| 35 |
+
STATE="$HOME/.surrogate/state"
|
| 36 |
+
LOGS="$HOME/.surrogate/logs"
|
| 37 |
+
mkdir -p "$STATE" "$LOGS"
|
| 38 |
+
|
| 39 |
+
log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOGS/overnight-pipeline.log"; }
|
| 40 |
+
notify() {
|
| 41 |
+
[[ -z "${DISCORD_WEBHOOK:-}" ]] && return
|
| 42 |
+
curl -s -X POST -H "Content-Type: application/json" \
|
| 43 |
+
-d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
|
| 44 |
+
"$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
log "βββ overnight-pipeline starting βββ"
|
| 48 |
+
|
| 49 |
+
# ββ 1. Sanity checks ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 50 |
+
log "ββ sanity checks ββ"
|
| 51 |
+
if ! pgrep -f auto-swap-and-bench.sh >/dev/null; then
|
| 52 |
+
log " β auto-swap-and-bench daemon NOT running β aborting"
|
| 53 |
+
notify "β overnight-pipeline aborted: auto-swap-and-bench not running"
|
| 54 |
+
exit 1
|
| 55 |
+
fi
|
| 56 |
+
log " β auto-swap-and-bench alive (pid $(pgrep -f auto-swap-and-bench.sh | head -1))"
|
| 57 |
+
|
| 58 |
+
if [[ -z "${HF_TOKEN:-}" ]] || [[ -z "${DISCORD_WEBHOOK:-}" ]]; then
|
| 59 |
+
log " β HF_TOKEN or DISCORD_WEBHOOK missing β pipeline will work but no notifications"
|
| 60 |
+
fi
|
| 61 |
+
|
| 62 |
+
# ββ 2. Watchdog (FIRST, kill-switch armed before risk) βββββββββββββββββββββ
|
| 63 |
+
log "ββ starting watchdog ββ"
|
| 64 |
+
rm -f "$STATE/watchdog-killed" # re-arm if previously tripped
|
| 65 |
+
if pgrep -f "watchdog.sh" >/dev/null; then
|
| 66 |
+
log " watchdog already running (pid $(pgrep -f watchdog.sh | head -1)) β leaving as-is"
|
| 67 |
+
else
|
| 68 |
+
nohup bash "$HFB/watchdog.sh" > "$LOGS/watchdog.log" 2>&1 &
|
| 69 |
+
disown
|
| 70 |
+
sleep 2
|
| 71 |
+
if pgrep -f "watchdog.sh" >/dev/null; then
|
| 72 |
+
log " β watchdog started pid $(pgrep -f watchdog.sh | head -1)"
|
| 73 |
+
else
|
| 74 |
+
log " β watchdog failed to start β see $LOGS/watchdog.log"
|
| 75 |
+
exit 1
|
| 76 |
+
fi
|
| 77 |
+
fi
|
| 78 |
+
|
| 79 |
+
# ββ 3. Autonomous SRE ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 80 |
+
log "ββ starting autonomous-sre ββ"
|
| 81 |
+
if pgrep -f "autonomous-sre.sh" >/dev/null; then
|
| 82 |
+
log " already running (pid $(pgrep -f autonomous-sre.sh | head -1)) β leaving as-is"
|
| 83 |
+
else
|
| 84 |
+
nohup bash "$HFB/autonomous-sre.sh" > "$LOGS/autonomous-sre.log" 2>&1 &
|
| 85 |
+
disown
|
| 86 |
+
sleep 2
|
| 87 |
+
if pgrep -f "autonomous-sre.sh" >/dev/null; then
|
| 88 |
+
log " β autonomous-sre started pid $(pgrep -f autonomous-sre.sh | head -1)"
|
| 89 |
+
else
|
| 90 |
+
log " β autonomous-sre failed to start"
|
| 91 |
+
fi
|
| 92 |
+
fi
|
| 93 |
+
|
| 94 |
+
# ββ 4. Delayed autonomous-release launcher (fires only after bench passes) β
|
| 95 |
+
log "ββ scheduling delayed autonomous-release launcher ββ"
|
| 96 |
+
cat > "$STATE/release-gate.sh" <<'GATEEOF'
|
| 97 |
+
#!/usr/bin/env bash
|
| 98 |
+
# Polls bench summary.json. Fires autonomous-release.sh only if V8 wins
|
| 99 |
+
# β₯3/9 metrics over v1 (Branch A or B threshold). Otherwise stays dormant
|
| 100 |
+
# and notifies β autonomous coding work is gated on actual capability.
|
| 101 |
+
[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
|
| 102 |
+
HFB="$HOME/.surrogate/hf-space/bin/v2"
|
| 103 |
+
STATE="$HOME/.surrogate/state"
|
| 104 |
+
LOGS="$HOME/.surrogate/logs"
|
| 105 |
+
|
| 106 |
+
log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" >> "$LOGS/release-gate.log"; }
|
| 107 |
+
notify() {
|
| 108 |
+
[[ -z "${DISCORD_WEBHOOK:-}" ]] && return
|
| 109 |
+
curl -s -X POST -H "Content-Type: application/json" \
|
| 110 |
+
-d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
|
| 111 |
+
"$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
# Poll up to MAX_HOURS for bench to complete
|
| 115 |
+
MAX_HOURS="${MAX_HOURS:-20}"
|
| 116 |
+
DEADLINE=$(( $(date +%s) + MAX_HOURS * 3600 ))
|
| 117 |
+
n=0
|
| 118 |
+
while [[ $(date +%s) -lt $DEADLINE ]]; do
|
| 119 |
+
n=$((n+1))
|
| 120 |
+
LATEST=$(ls -t "$HOME/.surrogate/eval/bench-v1-vs-v15-"*"/summary.json" 2>/dev/null | head -1)
|
| 121 |
+
if [[ -n "$LATEST" ]]; then
|
| 122 |
+
# Count V8 wins
|
| 123 |
+
wins=$(python3 -c "
|
| 124 |
+
import json, sys
|
| 125 |
+
try: d = json.load(open('$LATEST'))
|
| 126 |
+
except: print(0); sys.exit(0)
|
| 127 |
+
v8 = d.get('v1.2-research', {})
|
| 128 |
+
v1 = d.get('v1', {})
|
| 129 |
+
n = 0
|
| 130 |
+
for k in ('humaneval_plus','mbpp_plus','lcb_v6','bfcl_v3_overall',
|
| 131 |
+
'ruler_16k_avg','swebench_verified_lite100','axentx_eval_50',
|
| 132 |
+
'multi_iac_eval','itbench_lite'):
|
| 133 |
+
try:
|
| 134 |
+
if float(v8.get(k,'0') or 0) - float(v1.get(k,'0') or 0) > 0.5: n += 1
|
| 135 |
+
except: continue
|
| 136 |
+
print(n)
|
| 137 |
+
" 2>/dev/null || echo 0)
|
| 138 |
+
# Only proceed when bench actually populated v1.2-research column
|
| 139 |
+
has_v8=$(python3 -c "
|
| 140 |
+
import json
|
| 141 |
+
d = json.load(open('$LATEST'))
|
| 142 |
+
v8 = d.get('v1.2-research', {})
|
| 143 |
+
print(1 if any(v8.values()) else 0)
|
| 144 |
+
" 2>/dev/null)
|
| 145 |
+
if [[ "$has_v8" == "1" ]]; then
|
| 146 |
+
log "bench summary populated for v1.2-research, wins=$wins/9"
|
| 147 |
+
if (( wins >= 3 )); then
|
| 148 |
+
log "β FIRE autonomous-release.sh (Branch A/B threshold met)"
|
| 149 |
+
notify "π release-gate: V8 wins $wins/9 β starting autonomous-release on arkship"
|
| 150 |
+
nohup bash "$HFB/autonomous-release.sh" > "$LOGS/autonomous-release.log" 2>&1 &
|
| 151 |
+
disown
|
| 152 |
+
exit 0
|
| 153 |
+
else
|
| 154 |
+
log "β HOLD (only $wins/9 wins, threshold β₯3 β Branch C territory)"
|
| 155 |
+
notify "β release-gate: V8 only won $wins/9 β autonomous-release HELD (Branch C)"
|
| 156 |
+
exit 0
|
| 157 |
+
fi
|
| 158 |
+
fi
|
| 159 |
+
fi
|
| 160 |
+
sleep 600 # poll every 10 min
|
| 161 |
+
done
|
| 162 |
+
log "deadline reached without bench completion β exiting without firing"
|
| 163 |
+
notify "β° release-gate: 20h deadline hit, bench never completed"
|
| 164 |
+
GATEEOF
|
| 165 |
+
chmod +x "$STATE/release-gate.sh"
|
| 166 |
+
nohup bash "$STATE/release-gate.sh" >> "$LOGS/release-gate.log" 2>&1 &
|
| 167 |
+
disown
|
| 168 |
+
GATE_PID=$!
|
| 169 |
+
log " β release-gate spawned pid=$GATE_PID (polls bench, gated fire)"
|
| 170 |
+
|
| 171 |
+
# ββ 5. Self-improve hourly tick ββββββββββββββββββββββββββββββββββββββββββββ
|
| 172 |
+
log "ββ scheduling self-improve hourly tick ββ"
|
| 173 |
+
cat > "$STATE/self-improve-tick.sh" <<'TICKEOF'
|
| 174 |
+
#!/usr/bin/env bash
|
| 175 |
+
[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
|
| 176 |
+
while true; do
|
| 177 |
+
bash "$HOME/.surrogate/hf-space/bin/v2/self-improve.sh" all \
|
| 178 |
+
>> "$HOME/.surrogate/logs/self-improve.log" 2>&1
|
| 179 |
+
sleep 3600
|
| 180 |
+
done
|
| 181 |
+
TICKEOF
|
| 182 |
+
chmod +x "$STATE/self-improve-tick.sh"
|
| 183 |
+
nohup bash "$STATE/self-improve-tick.sh" >> "$LOGS/self-improve.log" 2>&1 &
|
| 184 |
+
disown
|
| 185 |
+
TICK_PID=$!
|
| 186 |
+
log " β self-improve hourly tick pid=$TICK_PID"
|
| 187 |
+
|
| 188 |
+
# ββ 6. Schedule overnight-report β multiple wake-up windows ββββββββββββββββ
|
| 189 |
+
# V8 won't be fully benched until ~T+12-16h, so fire the report multiple
|
| 190 |
+
# times so user gets meaningful TLDR no matter when they wake up. Each fire
|
| 191 |
+
# is idempotent β overwrites today's report file with latest score.
|
| 192 |
+
log "ββ scheduling overnight-report (multi-fire windows) ββ"
|
| 193 |
+
schedule_report() {
|
| 194 |
+
local label="$1" hour="$2"
|
| 195 |
+
local sleep_sec
|
| 196 |
+
sleep_sec=$(python3 -c "
|
| 197 |
+
import datetime, zoneinfo
|
| 198 |
+
now = datetime.datetime.now(zoneinfo.ZoneInfo('Asia/Bangkok'))
|
| 199 |
+
target = now.replace(hour=$hour, minute=0, second=0, microsecond=0)
|
| 200 |
+
if target <= now:
|
| 201 |
+
target += datetime.timedelta(days=1)
|
| 202 |
+
print(int((target - now).total_seconds()))
|
| 203 |
+
")
|
| 204 |
+
nohup bash -c "sleep $sleep_sec; bash '$HFB/overnight-report.sh' >> '$LOGS/overnight-report.log' 2>&1" \
|
| 205 |
+
>> "$LOGS/overnight-report.log" 2>&1 &
|
| 206 |
+
disown
|
| 207 |
+
local pid=$!
|
| 208 |
+
log " β $label fire @ ${hour}:00 Bangkok pid=$pid (in $(($sleep_sec/3600))h $(($sleep_sec%3600/60))m)"
|
| 209 |
+
}
|
| 210 |
+
# 3 wake-up windows: early-morning baseline, post-bench likely, evening
|
| 211 |
+
schedule_report "early" 7
|
| 212 |
+
schedule_report "mid" 11
|
| 213 |
+
schedule_report "evening" 17
|
| 214 |
+
|
| 215 |
+
# ββ 7. Wire release-gate to fire overnight-report on bench completion ββββββ
|
| 216 |
+
log "ββ adding event-triggered report fire to release-gate ββ"
|
| 217 |
+
# patch release-gate to call overnight-report after its decision
|
| 218 |
+
cat >> "$STATE/release-gate.sh" <<EVTEOF
|
| 219 |
+
|
| 220 |
+
# Auto-fire report after gate decides (event-triggered, complements time-based)
|
| 221 |
+
bash "$HFB/overnight-report.sh" >> "$LOGS/overnight-report.log" 2>&1 || true
|
| 222 |
+
EVTEOF
|
| 223 |
+
log " β report fires automatically when release-gate completes"
|
| 224 |
+
|
| 225 |
+
# ββ 7. Pre-flight summary ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 226 |
+
log ""
|
| 227 |
+
log "βββ overnight pipeline ARMED βββ"
|
| 228 |
+
log " watchdog : pid $(pgrep -f watchdog.sh | head -1)"
|
| 229 |
+
log " autonomous-sre : pid $(pgrep -f autonomous-sre.sh | head -1)"
|
| 230 |
+
log " auto-swap-and-bench : pid $(pgrep -f auto-swap-and-bench.sh | head -1)"
|
| 231 |
+
log " release-gate : pid $GATE_PID (polls bench, gated by β₯3/9 wins)"
|
| 232 |
+
log " self-improve tick : pid $TICK_PID (every hour)"
|
| 233 |
+
log " overnight-report : pid $REPORT_PID (fires at 06:00 Bangkok)"
|
| 234 |
+
log ""
|
| 235 |
+
log " rubric: $HOME/.surrogate/state/specs/overnight-rubric.md"
|
| 236 |
+
log " report: $HOME/.surrogate/state/overnight-report-$(date +%Y-%m-%d).md (after fire)"
|
| 237 |
+
|
| 238 |
+
notify "π overnight pipeline armed β wakeup report at 06:00 (rubric: 4-axis Γ 100pts, Discord ping when done)"
|
| 239 |
+
log "βββ done βββ"
|
|
@@ -0,0 +1,471 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 β overnight evaluation report generator.
|
| 3 |
+
#
|
| 4 |
+
# Reads all evidence (bench summary, outcomes log, queue, autonomous-release
|
| 5 |
+
# specs, daemon logs) and computes the 4-axis weighted score from
|
| 6 |
+
# ~/.surrogate/state/specs/overnight-rubric.md.
|
| 7 |
+
#
|
| 8 |
+
# Output: a single dense markdown report at
|
| 9 |
+
# ~/.surrogate/state/overnight-report-YYYY-MM-DD.md
|
| 10 |
+
# + Discord ping with TLDR + verdict.
|
| 11 |
+
#
|
| 12 |
+
# Run on schedule (cron at 06:00 Bangkok = 23:00 UTC) or manually:
|
| 13 |
+
# bash bin/v2/overnight-report.sh
|
| 14 |
+
set -uo pipefail
|
| 15 |
+
[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
|
| 16 |
+
|
| 17 |
+
STATE="$HOME/.surrogate/state"
|
| 18 |
+
LOGS="$HOME/.surrogate/logs"
|
| 19 |
+
EVAL="$HOME/.surrogate/eval"
|
| 20 |
+
TODAY="$(date +%Y-%m-%d)"
|
| 21 |
+
REPORT="$STATE/overnight-report-$TODAY.md"
|
| 22 |
+
mkdir -p "$STATE"
|
| 23 |
+
|
| 24 |
+
notify() {
|
| 25 |
+
[[ -z "${DISCORD_WEBHOOK:-}" ]] && return
|
| 26 |
+
local content="$1"
|
| 27 |
+
curl -s -X POST -H "Content-Type: application/json" \
|
| 28 |
+
-d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$content")" \
|
| 29 |
+
"$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
|
| 30 |
+
}
|
| 31 |
+
|
| 32 |
+
# ββ Compute everything in one Python pass for consistency ββββββββββββββββββββ
|
| 33 |
+
python3 - "$REPORT" <<'PYEOF'
|
| 34 |
+
import json, os, re, sys, glob, datetime as dt
|
| 35 |
+
from pathlib import Path
|
| 36 |
+
from collections import Counter
|
| 37 |
+
|
| 38 |
+
REPORT = sys.argv[1]
|
| 39 |
+
HOME = Path.home()
|
| 40 |
+
STATE = HOME / ".surrogate/state"
|
| 41 |
+
EVAL = HOME / ".surrogate/eval"
|
| 42 |
+
LOGS = HOME / ".surrogate/logs"
|
| 43 |
+
|
| 44 |
+
now = dt.datetime.now(dt.timezone.utc)
|
| 45 |
+
def read_lines(p):
|
| 46 |
+
try: return Path(p).read_text().splitlines()
|
| 47 |
+
except: return []
|
| 48 |
+
|
| 49 |
+
def jsonl(p):
|
| 50 |
+
rows = []
|
| 51 |
+
for L in read_lines(p):
|
| 52 |
+
try: rows.append(json.loads(L))
|
| 53 |
+
except: pass
|
| 54 |
+
return rows
|
| 55 |
+
|
| 56 |
+
# ββ Evidence: bench summary (4-way Γ 9 evals) ββββββββββββββββββββββββββββββββ
|
| 57 |
+
def latest_bench():
|
| 58 |
+
candidates = sorted(EVAL.glob("bench-v1-vs-v15-*/summary.json"),
|
| 59 |
+
key=lambda p: p.stat().st_mtime, reverse=True)
|
| 60 |
+
return candidates[0] if candidates else None
|
| 61 |
+
|
| 62 |
+
bench_path = latest_bench()
|
| 63 |
+
bench = json.loads(bench_path.read_text()) if bench_path and bench_path.exists() else {}
|
| 64 |
+
|
| 65 |
+
def score_metric(label, key, target):
|
| 66 |
+
"""Compare V8 (v1.2-research) vs v1 baseline."""
|
| 67 |
+
try:
|
| 68 |
+
v8 = float(bench.get("v1.2-research", {}).get(key, "0") or 0)
|
| 69 |
+
v1 = float(bench.get("v1", {}).get(key, "0") or 0)
|
| 70 |
+
delta = v8 - v1
|
| 71 |
+
win = delta > 0.5
|
| 72 |
+
return v8, v1, delta, win
|
| 73 |
+
except Exception:
|
| 74 |
+
return None, None, None, False
|
| 75 |
+
|
| 76 |
+
# ββ Evidence: outcomes.jsonl (SRE + release daemon traces) βββββββββββββββββββ
|
| 77 |
+
outcomes = jsonl(STATE / "outcomes.jsonl")
|
| 78 |
+
sre_rows = [r for r in outcomes if r.get("daemon") == "sre"]
|
| 79 |
+
release_rows = [r for r in outcomes if r.get("daemon") == "release"]
|
| 80 |
+
|
| 81 |
+
# ββ Evidence: queue + draft PRs ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 82 |
+
queue_dirs = list((STATE / "queue").glob("*")) if (STATE / "queue").exists() else []
|
| 83 |
+
draft_pr_count = 0
|
| 84 |
+
release_cycles = list(STATE.glob("release-*"))
|
| 85 |
+
for rc in release_cycles:
|
| 86 |
+
log = rc / "winner.json"
|
| 87 |
+
if log.exists():
|
| 88 |
+
try:
|
| 89 |
+
j = json.loads(log.read_text())
|
| 90 |
+
if j.get("target_file"):
|
| 91 |
+
draft_pr_count += 1
|
| 92 |
+
except: pass
|
| 93 |
+
|
| 94 |
+
# ββ Self-improve flywheel state βββββββββββββββββββββββββββββββββββββββββββββ
|
| 95 |
+
si_dir = STATE / "self-improve"
|
| 96 |
+
sft_n = len(read_lines(si_dir / "sft.jsonl")) if si_dir.exists() else 0
|
| 97 |
+
kto_n = len(read_lines(si_dir / "kto.jsonl")) if si_dir.exists() else 0
|
| 98 |
+
skills_n = len(read_lines(si_dir / "skills.jsonl")) if si_dir.exists() else 0
|
| 99 |
+
|
| 100 |
+
# ββ Watchdog kills βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 101 |
+
killed_marker = STATE / "watchdog-killed"
|
| 102 |
+
watchdog_killed = killed_marker.exists()
|
| 103 |
+
|
| 104 |
+
# ββ Idempotency state (replay storms) ββββββββββββββββββββββββββββββββββββββββ
|
| 105 |
+
idem_rows = jsonl(STATE / "idempotency.jsonl")
|
| 106 |
+
idem_keys = Counter(r.get("key") for r in idem_rows)
|
| 107 |
+
replay_storms = sum(1 for _, c in idem_keys.items() if c >= 2)
|
| 108 |
+
|
| 109 |
+
# ββ Axis 1: Coding Capability (40%) βββββββββββββββββββββββββββββββββββββββββ
|
| 110 |
+
def score_coding():
|
| 111 |
+
parts = {}
|
| 112 |
+
if bench:
|
| 113 |
+
wins = 0
|
| 114 |
+
for key in ("humaneval_plus", "mbpp_plus", "lcb_v6", "bfcl_v3_overall",
|
| 115 |
+
"ruler_16k_avg", "swebench_verified_lite100",
|
| 116 |
+
"axentx_eval_50", "multi_iac_eval", "itbench_lite"):
|
| 117 |
+
_, _, _, w = score_metric(key, key, "")
|
| 118 |
+
if w: wins += 1
|
| 119 |
+
parts["wins_pts"] = (10 if wins >= 7 else 7 if wins >= 5
|
| 120 |
+
else 4 if wins >= 3 else 1)
|
| 121 |
+
parts["wins_count"] = wins
|
| 122 |
+
|
| 123 |
+
bfcl = float(bench.get("v1.2-research", {}).get("bfcl_v3_overall", "0") or 0)
|
| 124 |
+
parts["bfcl_score"] = bfcl
|
| 125 |
+
parts["bfcl_pts"] = (10 if bfcl >= 70 else 7 if bfcl >= 60
|
| 126 |
+
else 4 if bfcl >= 50 else 1)
|
| 127 |
+
|
| 128 |
+
miac = float(bench.get("v1.2-research", {}).get("multi_iac_eval", "0") or 0)
|
| 129 |
+
itb = float(bench.get("v1.2-research", {}).get("itbench_lite", "0") or 0)
|
| 130 |
+
if miac >= 55 and itb >= 55: parts["domain_pts"] = 10
|
| 131 |
+
elif miac >= 55 or itb >= 55: parts["domain_pts"] = 6
|
| 132 |
+
elif miac >= 40 or itb >= 40: parts["domain_pts"] = 3
|
| 133 |
+
else: parts["domain_pts"] = 1
|
| 134 |
+
parts["multi_iac"] = miac
|
| 135 |
+
parts["itbench"] = itb
|
| 136 |
+
else:
|
| 137 |
+
parts["wins_pts"] = parts["bfcl_pts"] = parts["domain_pts"] = 0
|
| 138 |
+
parts["wins_count"] = parts["bfcl_score"] = parts["multi_iac"] = parts["itbench"] = 0
|
| 139 |
+
|
| 140 |
+
if draft_pr_count >= 10: parts["pr_pts"] = 10
|
| 141 |
+
elif draft_pr_count >= 5: parts["pr_pts"] = 7
|
| 142 |
+
elif draft_pr_count >= 1: parts["pr_pts"] = 4
|
| 143 |
+
else: parts["pr_pts"] = 0
|
| 144 |
+
parts["pr_count"] = draft_pr_count
|
| 145 |
+
|
| 146 |
+
parts["sub_total"] = parts["wins_pts"] + parts["bfcl_pts"] + parts["domain_pts"] + parts["pr_pts"]
|
| 147 |
+
parts["axis_score"] = parts["sub_total"] * 4 / 4 # Γ4 / 4 = same; but kept for symmetry
|
| 148 |
+
parts["axis_max"] = 40
|
| 149 |
+
return parts
|
| 150 |
+
|
| 151 |
+
# ββ Axis 2: Incident Response (25%) βββββββββββββββββββββββββββββββββββββββββ
|
| 152 |
+
def score_incident():
|
| 153 |
+
parts = {}
|
| 154 |
+
triggered = [r for r in sre_rows if r.get("response")]
|
| 155 |
+
parts["triggered_n"] = len(triggered)
|
| 156 |
+
if triggered:
|
| 157 |
+
with_fix = [r for r in triggered
|
| 158 |
+
if isinstance(r.get("response"), dict)
|
| 159 |
+
and r["response"].get("fix_kind", "none") != "none"
|
| 160 |
+
and float(r["response"].get("confidence", 0)) >= 0.7]
|
| 161 |
+
diag_pct = 100 * len(with_fix) / max(1, len(triggered))
|
| 162 |
+
else:
|
| 163 |
+
with_fix, diag_pct = [], 0
|
| 164 |
+
parts["diag_pct"] = diag_pct
|
| 165 |
+
parts["diag_pts"] = (10 if diag_pct >= 80 else 7 if diag_pct >= 50
|
| 166 |
+
else 4 if diag_pct >= 20 else 0)
|
| 167 |
+
|
| 168 |
+
verified = [r for r in sre_rows if r.get("verdict")]
|
| 169 |
+
if verified:
|
| 170 |
+
passed = [r for r in verified
|
| 171 |
+
if isinstance(r.get("verdict"), dict) and r["verdict"].get("ok")]
|
| 172 |
+
ver_pct = 100 * len(passed) / max(1, len(verified))
|
| 173 |
+
else:
|
| 174 |
+
passed, ver_pct = [], 0
|
| 175 |
+
parts["ver_pct"] = ver_pct
|
| 176 |
+
parts["ver_pts"] = (10 if ver_pct >= 40 else 7 if ver_pct >= 25
|
| 177 |
+
else 4 if ver_pct >= 10 else 0)
|
| 178 |
+
|
| 179 |
+
applied = [r for r in sre_rows if r.get("applied")]
|
| 180 |
+
if applied:
|
| 181 |
+
succ = [r for r in applied if r.get("outcome") == "success"]
|
| 182 |
+
apply_pct = 100 * len(succ) / max(1, len(applied))
|
| 183 |
+
else:
|
| 184 |
+
succ, apply_pct = [], 0
|
| 185 |
+
parts["apply_pct"] = apply_pct
|
| 186 |
+
parts["apply_pts"] = (10 if apply_pct >= 80 else 7 if apply_pct >= 50
|
| 187 |
+
else 4 if apply_pct >= 30 else 0)
|
| 188 |
+
|
| 189 |
+
# Hard refusals β any HardGuard hits in queued items?
|
| 190 |
+
hard_violations = 0
|
| 191 |
+
for q in queue_dirs:
|
| 192 |
+
verdict_path = q / "verdict.json"
|
| 193 |
+
if verdict_path.exists():
|
| 194 |
+
try:
|
| 195 |
+
v = json.loads(verdict_path.read_text())
|
| 196 |
+
for c in v.get("checks", []):
|
| 197 |
+
if c.get("name") == "policy" and c.get("status") == "FAIL":
|
| 198 |
+
if "destructive" in c.get("detail", "").lower() \
|
| 199 |
+
or "rm -rf" in c.get("detail", "") \
|
| 200 |
+
or "destroy" in c.get("detail", "").lower():
|
| 201 |
+
hard_violations += 1
|
| 202 |
+
break
|
| 203 |
+
except: pass
|
| 204 |
+
parts["hard_violations"] = hard_violations
|
| 205 |
+
parts["safety_pts"] = 10 if hard_violations == 0 else 0
|
| 206 |
+
|
| 207 |
+
parts["sub_total"] = parts["diag_pts"] + parts["ver_pts"] + parts["apply_pts"] + parts["safety_pts"]
|
| 208 |
+
parts["axis_score"] = parts["sub_total"] * 2.5 / 4
|
| 209 |
+
parts["axis_max"] = 25
|
| 210 |
+
return parts
|
| 211 |
+
|
| 212 |
+
# ββ Axis 3: Monitoring Reliability (15%) ββββββββββββββββββββββββββββββββββββ
|
| 213 |
+
def score_monitoring():
|
| 214 |
+
parts = {}
|
| 215 |
+
sre_log = LOGS / "autonomous-sre.log"
|
| 216 |
+
if sre_log.exists():
|
| 217 |
+
sweeps = len(re.findall(r"βββ SRE sweep βββ", sre_log.read_text()))
|
| 218 |
+
# Expected: every 5 min for ~12 hr = ~144 sweeps. Score β₯80% as 10
|
| 219 |
+
if sweeps >= 100: parts["sweep_pts"] = 10
|
| 220 |
+
elif sweeps >= 60: parts["sweep_pts"] = 7
|
| 221 |
+
elif sweeps >= 20: parts["sweep_pts"] = 4
|
| 222 |
+
else: parts["sweep_pts"] = 0 if sweeps == 0 else 2
|
| 223 |
+
parts["sweep_n"] = sweeps
|
| 224 |
+
else:
|
| 225 |
+
parts["sweep_pts"] = 0
|
| 226 |
+
parts["sweep_n"] = 0
|
| 227 |
+
|
| 228 |
+
probe_kinds = Counter(r.get("trigger", "").split(":")[0] for r in sre_rows)
|
| 229 |
+
expected_probes = {"hf_space_running", "hf_space_runtime_error", "hf_space_build_error",
|
| 230 |
+
"hf_dataset_stale", "zerogpu_smoke_fail", "gh_workflow_repeated_failure"}
|
| 231 |
+
seen_kinds = sum(1 for k in probe_kinds if k)
|
| 232 |
+
if seen_kinds >= 4: parts["probe_pts"] = 10
|
| 233 |
+
elif seen_kinds >= 2: parts["probe_pts"] = 6
|
| 234 |
+
elif seen_kinds >= 1: parts["probe_pts"] = 3
|
| 235 |
+
else: parts["probe_pts"] = 0
|
| 236 |
+
parts["probe_kinds_seen"] = seen_kinds
|
| 237 |
+
|
| 238 |
+
if watchdog_killed: parts["watchdog_pts"] = 0
|
| 239 |
+
else: parts["watchdog_pts"] = 10
|
| 240 |
+
parts["watchdog_killed"] = watchdog_killed
|
| 241 |
+
|
| 242 |
+
log_int_ok = True
|
| 243 |
+
try:
|
| 244 |
+
for L in (STATE / "outcomes.jsonl").read_text().splitlines():
|
| 245 |
+
json.loads(L)
|
| 246 |
+
except FileNotFoundError:
|
| 247 |
+
log_int_ok = True # nothing logged yet, trivially clean
|
| 248 |
+
except Exception:
|
| 249 |
+
log_int_ok = False
|
| 250 |
+
parts["log_pts"] = 10 if log_int_ok else 0
|
| 251 |
+
|
| 252 |
+
parts["sub_total"] = parts["sweep_pts"] + parts["probe_pts"] + parts["watchdog_pts"] + parts["log_pts"]
|
| 253 |
+
parts["axis_score"] = parts["sub_total"] * 1.5 / 4
|
| 254 |
+
parts["axis_max"] = 15
|
| 255 |
+
return parts
|
| 256 |
+
|
| 257 |
+
# ββ Axis 4: Agentic Breadth (20%) βββββββββββββββββββββββββββββββββββββββββββ
|
| 258 |
+
def score_breadth():
|
| 259 |
+
parts = {}
|
| 260 |
+
triggers = Counter(r.get("trigger", "?") for r in outcomes)
|
| 261 |
+
distinct = len(triggers)
|
| 262 |
+
if distinct >= 5: parts["trig_pts"] = 10
|
| 263 |
+
elif distinct >= 3: parts["trig_pts"] = 6
|
| 264 |
+
elif distinct >= 1: parts["trig_pts"] = 3
|
| 265 |
+
else: parts["trig_pts"] = 0
|
| 266 |
+
parts["distinct_triggers"] = distinct
|
| 267 |
+
|
| 268 |
+
fix_kinds = Counter()
|
| 269 |
+
for r in outcomes:
|
| 270 |
+
resp = r.get("response")
|
| 271 |
+
if isinstance(resp, dict):
|
| 272 |
+
fk = resp.get("fix_kind") or resp.get("kind")
|
| 273 |
+
if fk: fix_kinds[fk] += 1
|
| 274 |
+
distinct_kinds = len(fix_kinds)
|
| 275 |
+
if distinct_kinds >= 3: parts["kind_pts"] = 10
|
| 276 |
+
elif distinct_kinds >= 2: parts["kind_pts"] = 6
|
| 277 |
+
elif distinct_kinds >= 1: parts["kind_pts"] = 3
|
| 278 |
+
else: parts["kind_pts"] = 0
|
| 279 |
+
parts["fix_kinds"] = dict(fix_kinds)
|
| 280 |
+
|
| 281 |
+
if sft_n >= 50 and kto_n >= 100 and skills_n >= 3:
|
| 282 |
+
parts["si_pts"] = 10
|
| 283 |
+
elif sft_n >= 20 or kto_n >= 50:
|
| 284 |
+
parts["si_pts"] = 5
|
| 285 |
+
else:
|
| 286 |
+
parts["si_pts"] = 0
|
| 287 |
+
parts["sft_n"] = sft_n; parts["kto_n"] = kto_n; parts["skills_n"] = skills_n
|
| 288 |
+
|
| 289 |
+
parts["replay_storms"] = replay_storms
|
| 290 |
+
parts["idem_pts"] = 10 if replay_storms == 0 else max(0, 10 - 3*replay_storms)
|
| 291 |
+
|
| 292 |
+
parts["sub_total"] = parts["trig_pts"] + parts["kind_pts"] + parts["si_pts"] + parts["idem_pts"]
|
| 293 |
+
parts["axis_score"] = parts["sub_total"] * 2 / 4
|
| 294 |
+
parts["axis_max"] = 20
|
| 295 |
+
return parts
|
| 296 |
+
|
| 297 |
+
# ββ Compute total + verdict βββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 298 |
+
a1 = score_coding()
|
| 299 |
+
a2 = score_incident()
|
| 300 |
+
a3 = score_monitoring()
|
| 301 |
+
a4 = score_breadth()
|
| 302 |
+
|
| 303 |
+
total = round(a1["axis_score"] + a2["axis_score"] + a3["axis_score"] + a4["axis_score"], 1)
|
| 304 |
+
|
| 305 |
+
if total >= 80:
|
| 306 |
+
verdict = "π’ 'agentic ΰΈΰΈ±ΰΈ§ΰΈΰΈ£ΰΈ΄ΰΈ' β works as advertised"
|
| 307 |
+
next_action = "β
proceed full deployment, fire Civo v2 32B magnificent"
|
| 308 |
+
badge = "AGENTIC"
|
| 309 |
+
elif total >= 60:
|
| 310 |
+
verdict = "π‘ 'works in scope, needs targeted fixes'"
|
| 311 |
+
next_action = "π§ identify lowest sub-score, run focused training round (Branch B)"
|
| 312 |
+
badge = "PARTIAL"
|
| 313 |
+
elif total >= 40:
|
| 314 |
+
verdict = "π 'promise but flaky'"
|
| 315 |
+
next_action = "βοΈ Branch B with extended techniques OR data cleanup"
|
| 316 |
+
badge = "FLAKY"
|
| 317 |
+
else:
|
| 318 |
+
verdict = "π΄ 'not autonomous yet'"
|
| 319 |
+
next_action = "β Branch C β quality-cleanup.sh first, then re-train"
|
| 320 |
+
badge = "NOT_READY"
|
| 321 |
+
|
| 322 |
+
# ββ Render report βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 323 |
+
def fmt_score(p, _max):
|
| 324 |
+
return f"{p}/{_max}"
|
| 325 |
+
|
| 326 |
+
bench_table_rows = []
|
| 327 |
+
if bench:
|
| 328 |
+
for label in ("v1", "base7B", "v1.1-extended", "v1.2-research"):
|
| 329 |
+
m = bench.get(label, {})
|
| 330 |
+
bench_table_rows.append((label,
|
| 331 |
+
m.get("humaneval_plus", "β"),
|
| 332 |
+
m.get("mbpp_plus", "β"),
|
| 333 |
+
m.get("lcb_v6", "β"),
|
| 334 |
+
m.get("bfcl_v3_overall", "β"),
|
| 335 |
+
m.get("ruler_16k_avg", "β"),
|
| 336 |
+
m.get("swebench_verified_lite100", "β"),
|
| 337 |
+
m.get("axentx_eval_50", "β"),
|
| 338 |
+
m.get("multi_iac_eval", "β"),
|
| 339 |
+
m.get("itbench_lite", "β"),
|
| 340 |
+
))
|
| 341 |
+
|
| 342 |
+
md = f"""# Surrogate-1 V8 β Overnight Report
|
| 343 |
+
|
| 344 |
+
**Generated**: {now.strftime('%Y-%m-%d %H:%M UTC')}
|
| 345 |
+
**Subject**: `axentx/surrogate-1-7B-v1.2-research` on `surrogate1/surrogate-1-zero-gpu`
|
| 346 |
+
**Total score**: **{total}/100** β {verdict}
|
| 347 |
+
**Next action**: {next_action}
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
## Axis breakdown
|
| 352 |
+
|
| 353 |
+
| Axis | Score | Max |
|
| 354 |
+
|---|---|---|
|
| 355 |
+
| 1. Coding Capability | **{round(a1['axis_score'],1)}** | 40 |
|
| 356 |
+
| 2. Incident Response | **{round(a2['axis_score'],1)}** | 25 |
|
| 357 |
+
| 3. Monitoring Reliability | **{round(a3['axis_score'],1)}** | 15 |
|
| 358 |
+
| 4. Agentic Breadth | **{round(a4['axis_score'],1)}** | 20 |
|
| 359 |
+
| **TOTAL** | **{total}** | **100** |
|
| 360 |
+
|
| 361 |
+
---
|
| 362 |
+
|
| 363 |
+
## Axis 1 β Coding Capability ({round(a1['axis_score'],1)}/40)
|
| 364 |
+
|
| 365 |
+
| Sub | Score | Detail |
|
| 366 |
+
|---|---|---|
|
| 367 |
+
| Bench wins (v1.2 vs v1) | {a1['wins_pts']}/10 | {a1['wins_count']}/9 metrics improved |
|
| 368 |
+
| BFCL v3 (function calling) | {a1['bfcl_pts']}/10 | score = {a1['bfcl_score']:.1f} |
|
| 369 |
+
| In-domain (Multi-IaC + ITBench) | {a1['domain_pts']}/10 | Multi-IaC={a1['multi_iac']:.1f}, ITBench={a1['itbench']:.1f} |
|
| 370 |
+
| Arkship Phase B PRs | {a1['pr_pts']}/10 | {a1['pr_count']} draft PRs opened |
|
| 371 |
+
|
| 372 |
+
### Bench scores (4-way)
|
| 373 |
+
|
| 374 |
+
| model | HE+ | MBPP+ | LCB v6 | BFCL | RULER 16K | SWE-B | axentx-50 | Multi-IaC | ITBench |
|
| 375 |
+
|---|---|---|---|---|---|---|---|---|---|
|
| 376 |
+
"""
|
| 377 |
+
for r in bench_table_rows:
|
| 378 |
+
md += "| " + " | ".join(str(x) for x in r) + " |\n"
|
| 379 |
+
if not bench_table_rows:
|
| 380 |
+
md += "| _bench has not run yet β wait for `auto-swap-and-bench` daemon to fire_ ||||||||||\n"
|
| 381 |
+
|
| 382 |
+
md += f"""
|
| 383 |
+
|
| 384 |
+
---
|
| 385 |
+
|
| 386 |
+
## Axis 2 β Incident Response ({round(a2['axis_score'],1)}/25)
|
| 387 |
+
|
| 388 |
+
| Sub | Score | Detail |
|
| 389 |
+
|---|---|---|
|
| 390 |
+
| Diagnosis depth (fix_kindβ none, confβ₯0.7) | {a2['diag_pts']}/10 | {a2['diag_pct']:.0f}% on {a2['triggered_n']} anomalies |
|
| 391 |
+
| Verifier acceptance | {a2['ver_pts']}/10 | {a2['ver_pct']:.0f}% passed 9-layer gate |
|
| 392 |
+
| Apply success (no rollback) | {a2['apply_pts']}/10 | {a2['apply_pct']:.0f}% |
|
| 393 |
+
| Hard-refuse violations queued | {a2['safety_pts']}/10 | {a2['hard_violations']} destructive patterns reached queue |
|
| 394 |
+
|
| 395 |
+
---
|
| 396 |
+
|
| 397 |
+
## Axis 3 β Monitoring Reliability ({round(a3['axis_score'],1)}/15)
|
| 398 |
+
|
| 399 |
+
| Sub | Score | Detail |
|
| 400 |
+
|---|---|---|
|
| 401 |
+
| Sweep cadence | {a3['sweep_pts']}/10 | {a3['sweep_n']} sweeps recorded |
|
| 402 |
+
| Probe coverage | {a3['probe_pts']}/10 | {a3['probe_kinds_seen']} distinct probe kinds saw anomalies |
|
| 403 |
+
| Watchdog clean | {a3['watchdog_pts']}/10 | killed = {a3['watchdog_killed']} |
|
| 404 |
+
| Outcome log integrity | {a3['log_pts']}/10 | parses cleanly = {a3['log_pts']==10} |
|
| 405 |
+
|
| 406 |
+
---
|
| 407 |
+
|
| 408 |
+
## Axis 4 β Agentic Breadth ({round(a4['axis_score'],1)}/20)
|
| 409 |
+
|
| 410 |
+
| Sub | Score | Detail |
|
| 411 |
+
|---|---|---|
|
| 412 |
+
| Distinct triggers handled | {a4['trig_pts']}/10 | {a4['distinct_triggers']} types |
|
| 413 |
+
| Fix-kind diversity | {a4['kind_pts']}/10 | kinds = {a4['fix_kinds']} |
|
| 414 |
+
| Self-improve flywheel | {a4['si_pts']}/10 | SFT={a4['sft_n']}, KTO={a4['kto_n']}, skills={a4['skills_n']} |
|
| 415 |
+
| Idempotency working | {a4['idem_pts']}/10 | replay storms = {a4['replay_storms']} |
|
| 416 |
+
|
| 417 |
+
---
|
| 418 |
+
|
| 419 |
+
## Owner action menu
|
| 420 |
+
|
| 421 |
+
```bash
|
| 422 |
+
# 1. Pause everything (always safe, reversible)
|
| 423 |
+
pkill -f autonomous- ; pkill -f watchdog.sh
|
| 424 |
+
|
| 425 |
+
# 2. Inspect verifier-rejected actions awaiting your call
|
| 426 |
+
ls -la ~/.surrogate/state/queue/
|
| 427 |
+
|
| 428 |
+
# 3. See V8 vs v1 head-to-head deltas
|
| 429 |
+
cat {bench_path if bench_path else "(no-bench-yet)"}
|
| 430 |
+
|
| 431 |
+
# 4. Read full draft PRs queue (autonomous-release output)
|
| 432 |
+
gh pr list --label autonomous-release # if Github reachable
|
| 433 |
+
ls ~/.surrogate/state/release-*/
|
| 434 |
+
```
|
| 435 |
+
|
| 436 |
+
---
|
| 437 |
+
|
| 438 |
+
## What this scorecard means
|
| 439 |
+
|
| 440 |
+
**{badge}** = {verdict}
|
| 441 |
+
|
| 442 |
+
Cut-offs (from `~/.surrogate/state/specs/overnight-rubric.md`):
|
| 443 |
+
- 80β100 β π’ agentic ΰΈΰΈ±ΰΈ§ΰΈΰΈ£ΰΈ΄ΰΈ, fire Civo v2
|
| 444 |
+
- 60β79 β π‘ partial, targeted fix
|
| 445 |
+
- 40β59 β π flaky, Branch B retrain
|
| 446 |
+
- 0β39 β π΄ not yet, Branch C cleanup
|
| 447 |
+
|
| 448 |
+
---
|
| 449 |
+
*Generated by `~/.surrogate/hf-space/bin/v2/overnight-report.sh`*
|
| 450 |
+
"""
|
| 451 |
+
|
| 452 |
+
Path(REPORT).write_text(md)
|
| 453 |
+
print(f"REPORT_PATH={REPORT}")
|
| 454 |
+
print(f"TOTAL_SCORE={total}")
|
| 455 |
+
print(f"VERDICT={verdict}")
|
| 456 |
+
print(f"BADGE={badge}")
|
| 457 |
+
PYEOF
|
| 458 |
+
|
| 459 |
+
# Extract score + verdict from report (Python already wrote it)
|
| 460 |
+
TOTAL=$(grep -oE "Total score.*\*\*[0-9.]+/100\*\*" "$REPORT" | head -1 | grep -oE "[0-9.]+" | head -1)
|
| 461 |
+
VERDICT_LINE=$(grep -oE "π’|π‘|π |π΄" "$REPORT" | head -1)
|
| 462 |
+
|
| 463 |
+
# Discord ping
|
| 464 |
+
notify "π
Overnight Report ready (score=${TOTAL:-?}/100 ${VERDICT_LINE:-}) β see $REPORT"
|
| 465 |
+
|
| 466 |
+
echo ""
|
| 467 |
+
echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 468 |
+
echo " π OVERNIGHT REPORT COMPLETE"
|
| 469 |
+
echo " $REPORT"
|
| 470 |
+
echo " Total: ${TOTAL:-?}/100 ${VERDICT_LINE:-}"
|
| 471 |
+
echo "βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ"
|