Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 8 days ago

Commit

4e166c6

1 Parent(s): dddf626

feat(v8+autonomy): research-driven trainer + 4 daemons + 9-layer safety gate

Synthesised from 4 parallel research streams (~1.6k lines of dense notes
in knowledge/trends-2026/) into one shippable change.

V8 trainer (kaggle-trainer.sh) — 5 research-grounded additions:
• PiSSA SVD init (replaces LoftQ default; LoftQ kept as fallback via
SUR_LORA_INIT=loftq) — Meng '24, +1-3pp on code benchmarks.
• LoRA+ optimizer with lr_B = 16·lr_A — Hayou '24, free +1-2pp,
via peft.optimizers.create_loraplus_optimizer with manual-split
fallback for older peft.
• V8 dataset blend via merge_external() — ToolACE 1.5×, Multi-IaC-Eval
2×, xLAM-fn-call-60k 1×, ITBench-Trajectories 2×, Code-Feedback 1×.
Each take/weight env-tunable; format-tolerant via extract_pair().
• GRPO Phase-2 scaffold (RUN_GRPO=1) — DeepSeekMath/RLVR-Code, post-SFT
booster with execution-pass reward function. Disabled by default
(needs TRL ≥0.12 + ≥30GB VRAM headroom).
• Hub bumped: axentx/surrogate-1-7B-v1.2-research.

Autonomous daemons (4) — all share verifier-ensemble.py + outcome-log.py:
• autonomous-sre.sh — 5-min sweep: HF Space stages, dataset staleness,
ZeroGPU smoke, GH Action failure rate, outcome-log self-health. On
anomaly: build prompt → call Surrogate → idempotency dedupe → 9-layer
verifier → apply OR queue. Whitelisted scope: only systems Surrogate
owns (no AWS/prod).
• autonomous-release.sh — hourly recon: HN+GH-trending+ProductHunt
cluster by owner-relevant keywords, build SDD spec, generate 3 patch
candidates with CISC self-consistency voting (research §autonomous-24x7
pattern 1), pick best by verifier+confidence, open draft PR via gh.
• self-improve.sh — daily/weekly flywheel: outcomes.jsonl → SFT replay
(success-only, RLEF-aligned), KTO unpaired (every label, lossless
on logs), skill library (verified procedures by trigger). Pushes to
axentx/surrogate-1-{self-traces,pref-kto,skills}; flags next training
when SFT≥200 or KTO≥500.
• watchdog.sh — independent observer with kill-switch. Detects loops
(≥5 same trigger in 15m), failure cascades (≥5 consecutive non-success),
rate spikes (≥30/min), audit gaps (applied without verdict), disk
fill. Never calls Surrogate, never applies; only kills + records.

Safety gate (verifier-ensemble.py) — single source of truth, 9 layers:
ast / lint (ruff/shellcheck/cfn-lint/tflint) / typecheck / tests /
policy (14-rule HardGuard list — terraform destroy, kubectl delete ns
prod-*, IAM Allow*:*, ec2 terminate w/o dry-run, rds delete w/o final
snapshot, helm install w/o digest pin, AKIA/private-key/sk-/hf_ leaks,
MFA bypass, force-push to main, etc) / security (gitleaks+semgrep+
cfn-guard) / diff sanity (≤300 lines, ≤8 files) / sandbox (docker
--network=none --read-only --cap-drop=ALL) / confidence (≥0.95 floor
for destructive-class actions). All non-SKIP must PASS, ≥3 verifiers
must run.

Helpers: surrogate-call.py (strict-JSON LLM call w/ retries + schema
validation for diagnosis|spec|patch), outcome-log.py (append-only JSONL),
idempotency.py (sha256(plan) ledger w/ TTL — prevents replay storms when
same anomaly fires twice).

Bench (bench-v1-vs-v15.sh): added 4th model (v1.2-research) + 2 new evals
(Multi-IaC-Eval CFN/TF/CDK pass-rate, ITBench-lite K8s SRE scenarios).
Now 4-way × 9 evals.

Architecture map: knowledge/surrogate-1-autonomous-arch.md — single
on-ramp doc with all components, file paths, run/disarm commands, the
14 HardGuards, and the V9 stretch ladder.

V7 train.py at ~/Desktop/surrogate-1-train-v7-7B-extended-plus.py is now
superseded by ~/Desktop/surrogate-1-train-v8-research.py. User uploads
the V8 file via Kaggle UI Replace File → Save Version when ready.

Files changed (10) hide show

bin/kaggle-trainer.sh +157 -14
bin/v2/autonomous-release.sh +425 -0
bin/v2/autonomous-sre.sh +346 -0
bin/v2/bench-v1-vs-v15.sh +24 -1
bin/v2/idempotency.py +118 -0
bin/v2/outcome-log.py +98 -0
bin/v2/self-improve.sh +283 -0
bin/v2/surrogate-call.py +177 -0
bin/v2/verifier-ensemble.py +404 -0
bin/v2/watchdog.sh +195 -0

bin/kaggle-trainer.sh CHANGED Viewed

@@ -206,8 +206,8 @@ EPOCHS = float(os.environ.get("EPOCHS", "1"))
 _default_hub = {
     32.0: "axentx/surrogate-1-coder-32B-v1.5",
     14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
-    7.0:  "axentx/surrogate-1-7B-v1.1-extended",   # ← T4×2 validation target
-}.get(_auto_size, "axentx/surrogate-1-7B-v1.1-extended")
 HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
 # seq_len auto-shrinks for smaller hardware budget
 _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
@@ -339,6 +339,44 @@ try:
 except Exception as e:
     print(f"  ✗ Magpie skip (repo not yet published): {type(e).__name__}: {str(e)[:80]}")
 raw = Dataset.from_list(rows)
 # (Active-learning teachable filter applied AFTER model load — see below.
 # Filtering needs the 4-bit base model to score perplexity, which doesn't
@@ -445,21 +483,28 @@ lora_kwargs = dict(
     use_dora=True,                                    # R2: DoRA
     task_type="CAUSAL_LM",
 )
-# RSLoRA + LoftQ require recent peft versions — fall back gracefully
 try:
     from peft import LoraConfig as _Probe
     import inspect
     _sig = inspect.signature(_Probe).parameters
     if "use_rslora" in _sig: lora_kwargs["use_rslora"] = True
     if "init_lora_weights" in _sig:
-        try:
-            from peft import LoftQConfig
-            lora_kwargs["init_lora_weights"] = "loftq"
-            lora_kwargs["loftq_config"] = LoftQConfig(loftq_bits=4, loftq_iter=5)
-        except Exception:
-            pass
-except Exception:
-    pass
 print(f"  LoRA config: r={LORA_R}, DoRA={lora_kwargs.get('use_dora')}, "
       f"RSLoRA={lora_kwargs.get('use_rslora', False)}, "
       f"init={lora_kwargs.get('init_lora_weights', 'gaussian')}, "
@@ -469,6 +514,44 @@ lora = LoraConfig(**lora_kwargs)
 model = get_peft_model(model, lora)
 model.print_trainable_parameters()
 # ── Format chat template (system + user + assistant) ────────────────────────
 def fmt(ex):
     msgs = [
@@ -522,12 +605,17 @@ sft_cfg = SFTConfig(
     report_to="none",
 )
-trainer = SFTTrainer(
     model=model,
     args=sft_cfg,
     train_dataset=raw,
     tokenizer=tok,
 )
 print()
 print("━━━ training start ━━━")
@@ -536,10 +624,65 @@ print("━━━ training done ━━━")
 # Final push (in case last save_steps didn't trigger)
 trainer.push_to_hub(commit_message=(
-    f"Surrogate-1 v1.5 SFT — base={BASE.split('/')[-1]}, "
-    f"r=32+DoRA, NEFTune α=5, seq={SEQ_LEN}, "
     f"{len(rows):,} samples × {EPOCHS} epochs (Kaggle T4×2)"))
 print("✅ pushed to", HUB_ID)
 PYEOF
 # ── Push notebook to Kaggle (creates if not exists, updates if exists) ─────

 _default_hub = {
     32.0: "axentx/surrogate-1-coder-32B-v1.5",
     14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
+    7.0:  "axentx/surrogate-1-7B-v1.2-research",   # ← V8: research-driven stack
+}.get(_auto_size, "axentx/surrogate-1-7B-v1.2-research")
 HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
 # seq_len auto-shrinks for smaller hardware budget
 _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
 except Exception as e:
     print(f"  ✗ Magpie skip (repo not yet published): {type(e).__name__}: {str(e)[:80]}")
+# ── V8 RESEARCH-DRIVEN DATASET BLEND ────────────────────────────────────────
+# From research §devsecops-sre-agentic.md (top-5 datasets) + §coding-llm-frontier
+# (#5 Code-Feedback). Each blend is opt-in via env knob (default ON).
+# Format-tolerant extract_pair() handles ShareGPT, instruction/output, etc.
+def merge_external(repo: str, take: int, weight: float, name: str):
+    """Stream-and-merge a HF dataset with weight oversampling."""
+    if take <= 0:
+        print(f"  - {name}: disabled (take=0)")
+        return 0
+    try:
+        # Many of these datasets are gated; use HF_TOKEN automatically
+        ds = load_dataset(repo, split="train", streaming=True)
+        n = 0
+        replicate = max(1, int(round(weight)))
+        for ex in ds:
+            if n >= take: break
+            pair = extract_pair(ex)
+            if not pair: continue
+            p, r = pair
+            for _ in range(replicate):
+                rows.append({"prompt": p, "response": r})
+            n += 1
+        print(f"  + {name}: {n:,} pairs × {replicate} = {n*replicate:,} rows merged")
+        return n
+    except Exception as e:
+        msg = f"{type(e).__name__}: {str(e)[:90]}"
+        print(f"  ✗ {name} skip ({repo}): {msg}")
+        return 0
+# Research-recommended weights — see knowledge/trends-2026/devsecops-sre-agentic.md
+merge_external("Team-ACE/ToolACE",                 int(os.environ.get("TAKE_TOOLACE",   "8000")),  1.5, "ToolACE")
+merge_external("AmazonScience/Multi-IaC-Eval",     int(os.environ.get("TAKE_MULTIIAC",  "5000")),  2.0, "Multi-IaC-Eval")
+merge_external("Salesforce/xlam-function-calling-60k", int(os.environ.get("TAKE_XLAM",  "10000")), 1.0, "xLAM-fn-call-60k")
+merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH",  "3000")),  2.0, "ITBench-Trajectories")
+merge_external("m-a-p/Code-Feedback",              int(os.environ.get("TAKE_CODEFB",    "8000")),  1.0, "Code-Feedback")
+print(f"  total rows after V8 blend: {len(rows):,}")
 raw = Dataset.from_list(rows)
 # (Active-learning teachable filter applied AFTER model load — see below.
 # Filtering needs the 4-bit base model to score perplexity, which doesn't
     use_dora=True,                                    # R2: DoRA
     task_type="CAUSAL_LM",
 )
+# V8: PiSSA init by default (research §coding-llm-frontier #4) — SVD of base
+# weights gives a much better starting point than gaussian. LoftQ/gaussian
+# remain as env-controlled fallback for A/B comparison.
+LORA_INIT = os.environ.get("SUR_LORA_INIT", "pissa_niter_4")
 try:
     from peft import LoraConfig as _Probe
     import inspect
     _sig = inspect.signature(_Probe).parameters
     if "use_rslora" in _sig: lora_kwargs["use_rslora"] = True
     if "init_lora_weights" in _sig:
+        if LORA_INIT.startswith("pissa"):
+            lora_kwargs["init_lora_weights"] = LORA_INIT   # "pissa" or "pissa_niter_K"
+        elif LORA_INIT == "loftq":
+            try:
+                from peft import LoftQConfig
+                lora_kwargs["init_lora_weights"] = "loftq"
+                lora_kwargs["loftq_config"] = LoftQConfig(loftq_bits=4, loftq_iter=5)
+            except Exception as e:
+                print(f"  ⚠ LoftQ unavailable, falling back to gaussian: {e}")
+        # else: gaussian default
+except Exception as e:
+    print(f"  ⚠ LoRA config probe failed: {e}")
 print(f"  LoRA config: r={LORA_R}, DoRA={lora_kwargs.get('use_dora')}, "
       f"RSLoRA={lora_kwargs.get('use_rslora', False)}, "
       f"init={lora_kwargs.get('init_lora_weights', 'gaussian')}, "
 model = get_peft_model(model, lora)
 model.print_trainable_parameters()
+# ── V8: LoRA+ optimizer (research §coding-llm-frontier #3) ──────────────────
+# Hayou et al 2024 (arxiv 2402.12354): the B matrix in LoRA needs a learning
+# rate ~16× higher than A for fastest convergence + +1-2pp benchmark lift.
+# Free improvement — no extra memory cost. Activated via SUR_LORA_PLUS_RATIO.
+LORA_PLUS_RATIO = float(os.environ.get("SUR_LORA_PLUS_RATIO", "16"))
+LORA_PLUS_OPT = None  # set later if available
+if LORA_PLUS_RATIO > 1.0:
+    try:
+        # peft.optimizers.create_loraplus_optimizer is the canonical helper
+        # (peft>=0.13). For older peft we fall back to manual param-group split.
+        from peft.optimizers import create_loraplus_optimizer  # type: ignore
+        import bitsandbytes as bnb_lib
+        LORA_PLUS_OPT = create_loraplus_optimizer(
+            model=model,
+            optimizer_cls=bnb_lib.optim.PagedAdamW8bit,
+            lr=float(os.environ.get("LEARNING_RATE", "7e-5")),
+            loraplus_lr_ratio=LORA_PLUS_RATIO,
+            weight_decay=0.01,
+        )
+        print(f"  LoRA+ optimizer: lr_B/lr_A = {LORA_PLUS_RATIO}x (paged AdamW 8-bit)")
+    except Exception as e:
+        print(f"  ⚠ LoRA+ helper unavailable ({type(e).__name__}: {e}) — manual split")
+        try:
+            import bitsandbytes as bnb_lib
+            param_groups = [
+                {"params": [p for n, p in model.named_parameters()
+                            if "lora_A" in n], "lr": float(os.environ.get("LEARNING_RATE", "7e-5"))},
+                {"params": [p for n, p in model.named_parameters()
+                            if "lora_B" in n], "lr": float(os.environ.get("LEARNING_RATE", "7e-5")) * LORA_PLUS_RATIO},
+            ]
+            LORA_PLUS_OPT = bnb_lib.optim.PagedAdamW8bit(param_groups, weight_decay=0.01)
+            print(f"  LoRA+ manual split: lr_B/lr_A = {LORA_PLUS_RATIO}x")
+        except Exception as e2:
+            print(f"  ⚠ LoRA+ manual split also failed ({e2}) — using SFTTrainer default optim")
+            LORA_PLUS_OPT = None
+else:
+    print("  LoRA+ disabled (SUR_LORA_PLUS_RATIO ≤ 1.0)")
 # ── Format chat template (system + user + assistant) ────────────────────────
 def fmt(ex):
     msgs = [
     report_to="none",
 )
+trainer_kwargs = dict(
     model=model,
     args=sft_cfg,
     train_dataset=raw,
     tokenizer=tok,
 )
+if LORA_PLUS_OPT is not None:
+    # Pass tuple (optimizer, lr_scheduler=None) so HF Trainer doesn't rebuild
+    trainer_kwargs["optimizers"] = (LORA_PLUS_OPT, None)
+trainer = SFTTrainer(**trainer_kwargs)
 print()
 print("━━━ training start ━━━")
 # Final push (in case last save_steps didn't trigger)
 trainer.push_to_hub(commit_message=(
+    f"Surrogate-1 v1.2-research SFT — base={BASE.split('/')[-1]}, "
+    f"r={LORA_R}+DoRA+RSLoRA+{lora_kwargs.get('init_lora_weights','gauss')}, "
+    f"LoRA+x{LORA_PLUS_RATIO} NEFTune α=5 seq={SEQ_LEN}, "
     f"{len(rows):,} samples × {EPOCHS} epochs (Kaggle T4×2)"))
 print("✅ pushed to", HUB_ID)
+# ── V8 GRPO Phase-2 hook (scaffold only — disabled by default) ─────────────
+# Research §coding-llm-frontier pick #1: post-SFT GRPO with execution-based
+# rewards is the BIGGEST single lift (+5-9pp LCB v6, +4-7pp HumanEval+).
+# Implementing the RL loop here would require a Python sandbox + unit-test
+# generator + group-of-N rollouts, all of which strain T4×2. Scaffolded but
+# gated behind RUN_GRPO=1 + TRL>=0.12 + ≥30GB peak VRAM headroom.
+if os.environ.get("RUN_GRPO", "0") == "1":
+    try:
+        from trl import GRPOTrainer, GRPOConfig  # type: ignore
+        print("━━━ Phase 2: GRPO with execution rewards (experimental) ━━━")
+        # Reward fn: run candidate code in subprocess, +1 if all unit tests
+        # pass, 0 otherwise. Group-of-4 rollouts per prompt.
+        import re, subprocess, tempfile, signal
+        def reward_unit_test_pass(prompts, completions, **kw):
+            rewards = []
+            for c in completions:
+                # Extract first ```python ... ``` block
+                m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
+                code = m.group(1) if m else c
+                with tempfile.NamedTemporaryFile("w", suffix=".py",
+                                                  delete=False) as f:
+                    f.write(code); pth = f.name
+                try:
+                    rc = subprocess.run(
+                        ["python", "-c", f"exec(open('{pth}').read())"],
+                        timeout=8, capture_output=True
+                    ).returncode
+                    rewards.append(1.0 if rc == 0 else 0.0)
+                except Exception:
+                    rewards.append(0.0)
+            return rewards
+        grpo_cfg = GRPOConfig(
+            output_dir="./surrogate-1-v1.2-research-grpo",
+            num_generations=4, learning_rate=5e-7,
+            num_train_epochs=1, per_device_train_batch_size=1,
+            gradient_accumulation_steps=8,
+            bf16=BF16_OK, fp16=not BF16_OK,
+            push_to_hub=True, hub_model_id=HUB_ID + "-grpo",
+            hub_token=os.environ.get("HF_TOKEN"),
+        )
+        grpo = GRPOTrainer(
+            model=model, args=grpo_cfg,
+            reward_funcs=[reward_unit_test_pass],
+            train_dataset=raw,
+        )
+        grpo.train()
+        grpo.push_to_hub(commit_message=f"Surrogate-1 v1.2-research GRPO Phase-2")
+        print("✅ GRPO Phase-2 done")
+    except ImportError as e:
+        print(f"  GRPO scaffold skipped — TRL too old: {e}")
+    except Exception as e:
+        print(f"  ⚠ GRPO Phase-2 failed: {type(e).__name__}: {e}")
+        print("  (SFT checkpoint is still saved — GRPO is post-SFT booster)")
 PYEOF
 # ── Push notebook to Kaggle (creates if not exists, updates if exists) ─────

bin/v2/autonomous-release.sh ADDED Viewed

	@@ -0,0 +1,425 @@

+#!/usr/bin/env bash
+# Surrogate-1 — autonomous release daemon.
+#
+# Loop:
+#   1. Recon — pull HN top, GitHub trending, ProductHunt feed, RSS list.
+#   2. Cluster signals into "feature gaps" (frequency × recency × relevance).
+#   3. For each above-threshold gap, ask Surrogate-1 to write a spec.md
+#      (problem / user-stories / acceptance / impact / out-of-scope).
+#   4. Ask Surrogate-1 for an implementation patch + tests.
+#   5. CISC self-consistency: generate 3 patch candidates, take the one
+#      that passes verifier-ensemble + has highest test pass rate.
+#   6. If verdict ok → open a draft PR in target repo, run CI in canary
+#      branch with metric-gated promotion (Flagger-style if available).
+#   7. Auto-rollback on SLO violation; auto-promote if green for COOLDOWN.
+#   8. Outcome → outcomes.jsonl for self-improve.
+#
+# Owner-controlled scope (only repos this daemon may touch):
+#   AUTO_RELEASE_REPOS env (comma-separated), default = axentx/surrogate-1
+#
+# Hard guards:
+#   - Never push to main; always open PR (draft) on a auto/* branch
+#   - Diff ≤ 600 lines, ≤ 12 files, must include tests
+#   - All HardGuards from verifier-ensemble.py apply
+#   - PR labeled "autonomous-release" + linked to outcome record id
+#
+# Usage:
+#   nohup bash bin/v2/autonomous-release.sh \
+#       > $HOME/.surrogate/logs/autonomous-release.log 2>&1 &
+#
+# Cron once per hour:
+#   0 * * * * bash $HOME/.surrogate/hf-space/bin/v2/autonomous-release.sh --once
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+HFB="$HOME/.surrogate/hf-space/bin/v2"
+STATE="$HOME/.surrogate/state"
+SPECS="$STATE/specs"
+LOG="$HOME/.surrogate/logs/autonomous-release.log"
+mkdir -p "$STATE" "$SPECS" "$(dirname "$LOG")"
+ONCE=0
+[[ "${1:-}" == "--once" ]] && ONCE=1
+INTERVAL_SEC="${REL_INTERVAL_SEC:-3600}"        # 1 h
+SPACE="${REL_SPACE:-surrogate1/surrogate-1-zero-gpu}"
+REPOS=(${AUTO_RELEASE_REPOS:-axentx/surrogate-1})
+RECON_LIMIT="${REL_RECON_LIMIT:-50}"
+CISC_N="${REL_CISC_N:-3}"
+GAP_FREQ_THRESHOLD="${REL_GAP_FREQ:-3}"          # signal must appear ≥3 sources
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "{\"content\":\"🚀 autonomous-release: $1\"}" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+# ── Recon: pull signals from public sources ─────────────────────────────────
+recon() {
+    local out="$1"
+    log "  recon → $out"
+    : > "$out"
+    # HN top stories — Algolia public API, no auth needed
+    curl -fsS --max-time 15 \
+        "https://hn.algolia.com/api/v1/search?tags=story&numericFilters=points>50&hitsPerPage=$RECON_LIMIT" \
+        2>/dev/null | python3 -c "
+import json, sys
+try: d = json.load(sys.stdin)
+except: sys.exit(0)
+for h in d.get('hits', []):
+    print(json.dumps({'src':'hn','title':h.get('title',''),'url':h.get('url',''),
+                      'score':h.get('points',0),'ts':h.get('created_at','')}))
+" 2>/dev/null >> "$out"
+    # GitHub trending — no official API, scrape via /trending
+    curl -fsS --max-time 20 \
+        "https://github.com/trending?since=daily&spoken_language_code=en" 2>/dev/null \
+        | python3 -c "
+import sys, re, json
+html = sys.stdin.read()
+# very light extractor — avoid pulling beautifulsoup just for this
+for m in re.finditer(r'<h2 class=\"h3 lh-condensed\">\s*<a href=\"([^\"]+)\"', html):
+    repo = m.group(1).lstrip('/')
+    print(json.dumps({'src':'gh-trending','title':repo,'url':'https://github.com/'+repo,
+                      'score':1,'ts':''}))
+" 2>/dev/null | head -n 30 >> "$out"
+    # ProductHunt — public RSS-ish endpoint
+    curl -fsS --max-time 15 \
+        "https://www.producthunt.com/feed" 2>/dev/null \
+        | python3 -c "
+import sys, re, json
+xml = sys.stdin.read()
+for m in re.finditer(r'<title>([^<]+)</title>\s*<link>([^<]+)</link>', xml)[:30] or []:
+    print(json.dumps({'src':'producthunt','title':m.group(1),'url':m.group(2),'score':1,'ts':''}))
+" 2>/dev/null >> "$out" || true
+    local n; n=$(wc -l < "$out" | tr -d ' ')
+    log "    collected $n signals"
+}
+# ── Gap analysis: cluster signals by keyword overlap ────────────────────────
+gap_analysis() {
+    local recon_in="$1" gaps_out="$2"
+    python3 - <<PYEOF
+import json, re, collections
+from pathlib import Path
+# Owner-relevant keywords — bias the funnel toward what Surrogate-1 cares about
+OWNER_KW = {
+    "agent","agentic","autonomous","llm","fine-tune","lora","peft",
+    "dpo","grpo","rlhf","rlaif","sft","quantization","bitsandbytes",
+    "vllm","sglang","tgi","inference","kubernetes","k8s","helm",
+    "terraform","cloudformation","aws","prowler","cspm","sre",
+    "incident","oncall","postmortem","observability","prometheus",
+    "opentelemetry","loki","grafana","argo","gitops","cicd",
+    "security","cve","cwe","sbom","slsa","supply-chain","gitleaks",
+    "semgrep","sast","dast","mcp","computer-use","tool-use","agent-bench"
+}
+sigs = []
+for L in open("$recon_in"):
+    try: sigs.append(json.loads(L))
+    except: pass
+# tokenize titles, score by owner-kw overlap
+def toks(s):
+    return set(t.lower() for t in re.findall(r"[a-zA-Z][a-zA-Z0-9-]+", s or ""))
+clusters = collections.defaultdict(list)
+for s in sigs:
+    t = toks(s.get("title", ""))
+    overlap = t & OWNER_KW
+    if not overlap:
+        continue
+    # bucket by sorted overlap as cluster key
+    key = "+".join(sorted(overlap)[:3])
+    clusters[key].append(s)
+gaps = []
+for key, items in clusters.items():
+    n_sources = len({i["src"] for i in items})
+    if n_sources >= $GAP_FREQ_THRESHOLD or len(items) >= 5:
+        gaps.append({
+            "topic": key,
+            "n_signals": len(items),
+            "n_sources": n_sources,
+            "examples": [{"title": i["title"], "url": i["url"]} for i in items[:5]],
+        })
+gaps.sort(key=lambda g: (g["n_sources"], g["n_signals"]), reverse=True)
+gaps = gaps[:5]   # cap at top 5 per cycle
+with open("$gaps_out", "w") as f:
+    json.dump(gaps, f, indent=2)
+print(f"  → {len(gaps)} gaps identified")
+PYEOF
+}
+# ── Build spec.md from a gap ────────────────────────────────────────────────
+build_spec() {
+    local gap_json="$1" spec_out="$2"
+    local work; work=$(mktemp -d)
+    cat > "$work/prompt.md" <<EOF
+You are Surrogate-1 in autonomous-release mode. A market signal cluster has
+crossed threshold. Synthesize a Spec-Driven-Development spec for ONE
+feature Surrogate-1 itself should ship — must be a small self-improvement
+to the Surrogate-1 platform (training scripts, daemons, evals, dataset
+quality tooling, etc.). Out of scope: external customer features, anything
+needing payment/PII/user data.
+Signal cluster:
+\`\`\`json
+$(cat "$gap_json")
+\`\`\`
+Owner constraints:
+  - Diff target ≤600 lines / ≤12 files
+  - Must include tests
+  - Must benefit at least one of: HumanEval+/MBPP+/LCB v6/SWE-Bench/BFCL/axentx-eval-50
+    OR the autonomous-{sre,release,improve} daemons.
+  - Must be reversible (rollback step required)
+Output ONLY this JSON schema:
+{
+  "title":               "<3-7 word feature name>",
+  "problem":             "<paragraph: what's missing today>",
+  "user_stories":        ["As Surrogate-1, I want X so that Y", ...],
+  "acceptance_criteria": ["Bench score Z improves by ≥N%", ...],
+  "impact":              "<expected metric uplift, citable>",
+  "competitors_observed": "<who is doing this elsewhere — from signal cluster>",
+  "out_of_scope":        ["...","..."],
+  "rollout_plan":        "<canary → promote, with SLO gate>",
+  "confidence":          0.0-1.0
+}
+EOF
+    python3 "$HFB/surrogate-call.py" --space "$SPACE" \
+        --prompt-file "$work/prompt.md" --schema spec \
+        --max-tokens 1500 --temperature 0.3 --out "$spec_out"
+    local rc=$?
+    rm -rf "$work"
+    return $rc
+}
+# ── Build patch candidates with CISC self-consistency ───────────────────────
+build_patch_cisc() {
+    local spec_path="$1" out_dir="$2"
+    mkdir -p "$out_dir"
+    local prompt; prompt=$(mktemp)
+    cat > "$prompt" <<EOF
+You are Surrogate-1. Implement the following spec. Produce a unified diff
++ test file. Diff must apply cleanly via \`patch -p1\`.
+Spec:
+\`\`\`json
+$(cat "$spec_path")
+\`\`\`
+Hard rules:
+  - Modify only files under \$HOME/.surrogate/hf-space/ or under axentx
+    repos cloned into \$HOME/develope/.
+  - Include or extend tests under tests/v2/ matching the changed file.
+  - No new top-level dependency without justification in the diff.
+  - Diff under 600 lines / 12 files.
+Output ONLY this JSON schema:
+{
+  "target_file": "<primary file path>",
+  "kind":        "code"|"iac"|"shell",
+  "patch":       "<unified diff text>",
+  "test_plan":   "<commands to verify post-apply>",
+  "rollback":    "<git revert <sha> or patch -R>",
+  "confidence":  0.0-1.0
+}
+EOF
+    for i in $(seq 1 $CISC_N); do
+        log "    CISC candidate $i/$CISC_N"
+        # vary temperature for diversity
+        local T; T=$(python3 -c "print(round(0.2 + 0.15*$i, 2))")
+        python3 "$HFB/surrogate-call.py" --space "$SPACE" \
+            --prompt-file "$prompt" --schema patch \
+            --max-tokens 2000 --temperature "$T" \
+            --out "$out_dir/cand-$i.json" 2>>"$LOG" || \
+            log "      cand-$i failed (continuing)"
+    done
+    rm -f "$prompt"
+    ls "$out_dir"/cand-*.json 2>/dev/null | wc -l | tr -d ' '
+}
+# ── Vote: pick best candidate by verifier verdict + confidence ──────────────
+pick_winner() {
+    local cand_dir="$1" winner_out="$2"
+    local best="" best_score=-1
+    for c in "$cand_dir"/cand-*.json; do
+        [[ -f "$c" ]] || continue
+        local target patch kind conf
+        target=$(python3 -c "import json; print(json.load(open('$c')).get('target_file',''))")
+        kind=$(python3 -c   "import json; print(json.load(open('$c')).get('kind','code'))")
+        conf=$(python3 -c   "import json; print(json.load(open('$c')).get('confidence',0))")
+        python3 -c "import json,sys; sys.stdout.write(json.load(open('$c')).get('patch',''))" > "$cand_dir/$(basename "$c" .json).patch"
+        local verdict_path="$cand_dir/$(basename "$c" .json).verdict.json"
+        python3 "$HFB/verifier-ensemble.py" \
+            --change "$cand_dir/$(basename "$c" .json).patch" \
+            --target "$target" --kind "$kind" --confidence "$conf" \
+            --out "$verdict_path" >/dev/null 2>&1 || true
+        local ok npass
+        ok=$(python3 -c   "import json; print(json.load(open('$verdict_path')).get('ok',False))" 2>/dev/null || echo False)
+        npass=$(python3 -c "import json; print(json.load(open('$verdict_path')).get('n_pass',0))" 2>/dev/null || echo 0)
+        local score
+        score=$(python3 -c "print(int($npass) + (10 if '$ok'=='True' else 0) + float($conf))")
+        log "      cand=$(basename "$c") ok=$ok pass=$npass conf=$conf → score=$score"
+        if (( $(python3 -c "print(1 if $score > $best_score else 0)") )); then
+            best="$c"; best_score=$score
+        fi
+    done
+    if [[ -n "$best" ]]; then
+        cp "$best" "$winner_out"
+        cp "$cand_dir/$(basename "$best" .json).verdict.json" "${winner_out%.json}.verdict.json"
+        log "    winner=$(basename "$best") score=$best_score"
+        return 0
+    fi
+    return 1
+}
+# ── Sweep ───────────────────────────────────────────────────────────────────
+sweep() {
+    local ts; ts=$(date -u +%Y%m%dT%H%M%SZ)
+    local cycle="$STATE/release-$ts"
+    mkdir -p "$cycle"
+    log "═══ release sweep $ts ═══"
+    recon "$cycle/recon.jsonl"
+    gap_analysis "$cycle/recon.jsonl" "$cycle/gaps.json"
+    local n_gaps
+    n_gaps=$(python3 -c "import json; print(len(json.load(open('$cycle/gaps.json'))))")
+    if (( n_gaps == 0 )); then
+        log "  no gaps above threshold — skipping cycle"
+        return 0
+    fi
+    # Process top gap only this cycle (avoid PR flood)
+    python3 -c "
+import json
+g = json.load(open('$cycle/gaps.json'))[0]
+json.dump(g, open('$cycle/top-gap.json', 'w'))
+print(g['topic'])
+" | while read -r topic; do
+        log "  top gap: $topic"
+        local spec_path="$cycle/spec.json"
+        if ! build_spec "$cycle/top-gap.json" "$spec_path"; then
+            log "  spec build failed — skipping"
+            continue
+        fi
+        local title
+        title=$(python3 -c "import json; print(json.load(open('$spec_path')).get('title','untitled'))")
+        log "  spec: $title"
+        local cand_dir="$cycle/candidates"
+        local n_cand
+        n_cand=$(build_patch_cisc "$spec_path" "$cand_dir")
+        log "  built $n_cand patch candidates (target $CISC_N)"
+        if (( n_cand == 0 )); then
+            python3 "$HFB/outcome-log.py" --daemon release --trigger "gap:$topic" \
+                --anomaly "$cycle/top-gap.json" --response "$spec_path" \
+                --applied false --outcome error \
+                --lesson "no patch candidates produced" || true
+            continue
+        fi
+        if pick_winner "$cand_dir" "$cycle/winner.json"; then
+            local ok
+            ok=$(python3 -c "import json; print(json.load(open('$cycle/winner.verdict.json')).get('ok',False))")
+            if [[ "$ok" == "True" ]]; then
+                log "  → opening draft PR"
+                open_draft_pr "$cycle"
+            else
+                log "  winner failed verifier — queueing"
+                python3 "$HFB/outcome-log.py" --daemon release --trigger "gap:$topic" \
+                    --anomaly "$cycle/top-gap.json" --response "$spec_path" \
+                    --verdict "$cycle/winner.verdict.json" \
+                    --applied false --outcome queued \
+                    --lesson "best candidate still failed verifier" || true
+            fi
+        else
+            log "  no winner — all candidates failed"
+        fi
+    done
+    log "═══ sweep done ═══"
+}
+# ── Open draft PR (gh CLI required) ─────────────────────────────────────────
+open_draft_pr() {
+    local cycle="$1"
+    if ! command -v gh >/dev/null 2>&1; then
+        log "  gh CLI missing — queueing instead of PR"
+        python3 "$HFB/outcome-log.py" --daemon release \
+            --trigger "release_cycle" \
+            --response "$cycle/winner.json" \
+            --applied false --outcome queued \
+            --lesson "gh CLI not installed" || true
+        return 1
+    fi
+    local target_repo="${REPOS[0]}"
+    local target_file patch_file branch
+    target_file=$(python3 -c "import json; print(json.load(open('$cycle/winner.json'))['target_file'])")
+    patch_file="$cycle/$(ls "$cycle/candidates"/*.patch 2>/dev/null | head -1 | xargs -n1 basename)"
+    branch="auto/release-$(date -u +%Y%m%d-%H%M)"
+    # Clone if not present
+    local clone_dir="$STATE/repos/$(basename "$target_repo")"
+    if [[ ! -d "$clone_dir/.git" ]]; then
+        gh repo clone "$target_repo" "$clone_dir" 2>>"$LOG" || {
+            log "  clone failed for $target_repo"
+            return 1
+        }
+    fi
+    ( cd "$clone_dir"
+      git fetch origin main 2>>"$LOG"
+      git checkout -B "$branch" origin/main 2>>"$LOG"
+      patch -p1 < "$patch_file" 2>>"$LOG" || { log "  patch apply failed"; exit 1; }
+      git add -A
+      git commit -m "auto-release: $(python3 -c "import json; print(json.load(open('$cycle/winner.json')).get('target_file',''))")
+auto-generated by autonomous-release.sh
+spec=$cycle/spec.json
+verdict=$cycle/winner.verdict.json"
+      git push -u origin "$branch" 2>>"$LOG"
+      gh pr create --draft --title "[auto-release] $(python3 -c "import json; print(json.load(open('$cycle/spec.json')).get('title',''))")" \
+                   --body "Autonomous release.
+**Spec**: see \`$cycle/spec.json\`
+**Verdict**: see \`$cycle/winner.verdict.json\`
+This PR was generated by Surrogate-1 autonomous-release daemon. It is a DRAFT — promote to ready-for-review only after CI passes and a human eyeballs the diff." \
+                   --label "autonomous-release" 2>&1 | tee -a "$LOG"
+    ) || true
+    python3 "$HFB/outcome-log.py" --daemon release \
+        --trigger "release_cycle" \
+        --anomaly "$cycle/top-gap.json" \
+        --response "$cycle/winner.json" \
+        --verdict "$cycle/winner.verdict.json" \
+        --applied true --outcome success \
+        --lesson "draft PR opened on $branch" || true
+    notify "draft PR opened on $target_repo / $branch"
+}
+if (( ONCE )); then
+    sweep
+    exit 0
+fi
+log "═══ autonomous-release starting (interval=${INTERVAL_SEC}s) ═══"
+notify "online — interval ${INTERVAL_SEC}s"
+while true; do
+    sweep
+    sleep "$INTERVAL_SEC"
+done

bin/v2/autonomous-sre.sh ADDED Viewed

	@@ -0,0 +1,346 @@

+#!/usr/bin/env bash
+# Surrogate-1 — autonomous SRE daemon.
+#
+# 24×7 monitors infra Surrogate-1 itself owns or operates against, and
+# tries to auto-heal incidents. Every candidate action passes through
+# verifier-ensemble.py — anything that fails verification is QUEUED, never
+# applied. The whole turn (anomaly → diagnosis → verdict → apply/queue →
+# metric_after) is logged to outcomes.jsonl so self-improve.sh can build
+# the next round's training data.
+#
+# Probe targets (all read-only by default):
+#   1. HF Spaces health        — runtime.stage / errorMessage
+#   2. HF Datasets growth      — last commit age (pipeline staleness)
+#   3. ZeroGPU smoke test      — small generation request
+#   4. Kaggle kernel state     — only if KAGGLE_KEY env is fresh
+#   5. AWS via aws-cli         — only if AWS_PROFILE set + Excise stack
+#   6. GH Actions runs         — `gh run list` for axentx orgs
+#
+# Auto-fix scope (whitelist of safe actions):
+#   - factory_reboot a stuck HF Space
+#   - re-trigger a failed GH workflow run
+#   - update a Space env var (already supported via swap-zerogpu-lora.sh)
+#   - apply a small (<300 line) diff to a file in $HOME/.surrogate/* if
+#     verifier-ensemble passes ALL checks
+#
+# Anything else → queued to ~/.surrogate/state/queue/<ts>.json for
+# operator review. Refused-by-policy actions are LOGGED but never queued.
+#
+# Usage (long-lived daemon):
+#   nohup bash bin/v2/autonomous-sre.sh \
+#       > $HOME/.surrogate/logs/autonomous-sre.log 2>&1 &
+#
+# Or via cron every 5 min:
+#   */5 * * * * bash $HOME/.surrogate/hf-space/bin/v2/autonomous-sre.sh --once
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+HFB="$HOME/.surrogate/hf-space/bin/v2"
+STATE="$HOME/.surrogate/state"
+QUEUE="$STATE/queue"
+LOG="$HOME/.surrogate/logs/autonomous-sre.log"
+mkdir -p "$STATE" "$QUEUE" "$(dirname "$LOG")"
+ONCE=0
+[[ "${1:-}" == "--once" ]] && ONCE=1
+INTERVAL_SEC="${SRE_INTERVAL_SEC:-300}"   # 5 min between full sweeps
+SPACE_PRIMARY="${SRE_SPACE_PRIMARY:-surrogate1/surrogate-1-zero-gpu}"
+SPACE_SECONDARY="${SRE_SPACE_SECONDARY:-ashirato/surrogate-1-zero-gpu}"
+DATASETS=(${SRE_DATASETS:-axentx/surrogate-1-pairs axentx/surrogate-1-pairs-shard1 axentx/surrogate-1-pairs-shard2 axentx/surrogate-1-pairs-shard3 axentx/surrogate-1-pairs-shard4})
+DATASET_STALE_HOURS="${SRE_DATASET_STALE_H:-3}"
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "{\"content\":\"🛡️ autonomous-sre: $1\"}" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+# ── Single shared call to record an anomaly + decide ────────────────────────
+handle_anomaly() {
+    local trigger="$1" anomaly_json="$2"
+    local ts; ts=$(date -u +%Y%m%dT%H%M%SZ)
+    local work; work=$(mktemp -d "$STATE/sre-$ts-XXXX")
+    echo "$anomaly_json" > "$work/anomaly.json"
+    # Build diagnosis prompt with last 3 outcomes for context
+    local recent
+    recent=$(tail -n 3 "$STATE/outcomes.jsonl" 2>/dev/null \
+        | python3 -c "import sys,json
+for L in sys.stdin:
+  try: r = json.loads(L)
+  except: continue
+  print(f\"- {r['ts']} {r['daemon']}/{r['trigger']} → {r['outcome']}\")
+" 2>/dev/null || echo "  (none)")
+    cat > "$work/prompt.md" <<EOF
+You are Surrogate-1 in SRE auto-heal mode. An anomaly has been detected by
+the autonomous-sre daemon. Diagnose root cause and propose ONE specific
+fix, OR explicitly say "fix_kind": "none" if you're <70% confident.
+Trigger: $trigger
+Anomaly details (JSON):
+\`\`\`json
+$anomaly_json
+\`\`\`
+Recent outcomes (last 3):
+$recent
+Hard constraints:
+  - Only propose fixes for systems Surrogate-1 owns: HF Spaces under
+    surrogate1/* + ashirato/* + axentx/*, HF datasets under axentx/*,
+    GH workflows in axentx repos. Refuse any AWS/prod/customer system.
+  - Diff must be <300 lines, ≤8 files.
+  - No destructive operations (rm -rf, DROP, kubectl delete ns, IAM \\*:\\*).
+  - If the fix is "factory_reboot Space X" — set fix_kind=shell with
+    patch=\`bash $HFB/swap-zerogpu-lora.sh AXENTX/<lora> ONLY=<name>\` style.
+Respond ONLY with this JSON schema:
+{
+  "diagnosis":   "<one-paragraph root cause>",
+  "fix_kind":    "code" | "iac" | "shell" | "sql" | "none",
+  "target_file": "<absolute path, or empty if shell-only>",
+  "patch":       "<unified diff or shell command>",
+  "rollback":    "<how to undo>",
+  "test_plan":   "<how we'll know it worked>",
+  "confidence":  0.0-1.0
+}
+EOF
+    log "  → calling Surrogate for diagnosis ($trigger)"
+    if ! python3 "$HFB/surrogate-call.py" \
+            --space "$SPACE_PRIMARY" \
+            --prompt-file "$work/prompt.md" \
+            --schema diagnosis \
+            --max-tokens 1200 --temperature 0.15 \
+            --out "$work/response.json" 2>"$work/call.err"; then
+        log "    ✗ surrogate-call failed: $(cat "$work/call.err" | head -c 200)"
+        python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
+            --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
+            --applied false --outcome error \
+            --lesson "endpoint unavailable for diagnosis" || true
+        return 1
+    fi
+    local fix_kind conf target patch
+    fix_kind=$(python3 -c "import json; print(json.load(open('$work/response.json')).get('fix_kind','none'))")
+    conf=$(python3 -c "import json; print(json.load(open('$work/response.json')).get('confidence',0))")
+    target=$(python3 -c "import json; print(json.load(open('$work/response.json')).get('target_file','') or '')")
+    patch=$(python3 -c "import json,sys; sys.stdout.write(json.load(open('$work/response.json')).get('patch',''))")
+    log "    diagnosis: fix_kind=$fix_kind confidence=$conf target=$target"
+    if [[ "$fix_kind" == "none" ]] || [[ -z "$patch" ]]; then
+        log "    Surrogate declined to act ($fix_kind / empty patch) — recording, no apply"
+        python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
+            --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
+            --response "$work/response.json" --applied false --outcome rejected \
+            --lesson "model declined low-confidence" || true
+        return 0
+    fi
+    # Write patch to file for verifier
+    echo "$patch" > "$work/patch.txt"
+    [[ -z "$target" ]] && target="$work/patch.txt"
+    # Idempotency check — if same patch was applied <4 h ago, skip
+    if python3 "$HFB/idempotency.py" check --plan "$work/patch.txt" --ttl-hours 4 \
+            >"$work/idem.json" 2>/dev/null; then
+        log "    idempotent skip — same patch applied recently"
+        python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
+            --anomaly "$work/anomaly.json" --response "$work/response.json" \
+            --applied false --outcome rejected \
+            --lesson "idempotent: $(python3 -c "import json; print(json.load(open('$work/idem.json'))['key'][:12])")" || true
+        return 0
+    fi
+    log "    → verifier-ensemble"
+    local vrc=0
+    python3 "$HFB/verifier-ensemble.py" \
+        --change "$work/patch.txt" --target "$target" --kind "$fix_kind" \
+        --confidence "$conf" --out "$work/verdict.json" >/dev/null || vrc=$?
+    local verdict_ok
+    verdict_ok=$(python3 -c "import json; print(json.load(open('$work/verdict.json')).get('ok',False))")
+    if [[ "$verdict_ok" != "True" ]]; then
+        log "    verdict: REJECTED → queueing for review"
+        cp -r "$work" "$QUEUE/$(basename "$work")"
+        python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
+            --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
+            --response "$work/response.json" --verdict "$work/verdict.json" \
+            --applied false --outcome queued \
+            --lesson "verifier rejected — manual review" || true
+        notify "queued $trigger ($(python3 -c "import json; print(', '.join(json.load(open('$work/verdict.json')).get('reasons',[])[:2]))"))"
+        return 0
+    fi
+    log "    verdict: SAFE → applying"
+    local apply_rc=0
+    if [[ "$fix_kind" == "shell" ]]; then
+        bash -c "$patch" 2>&1 | tee "$work/apply.log" || apply_rc=$?
+    elif [[ "$fix_kind" == "code" || "$fix_kind" == "iac" ]] && [[ -f "$target" ]]; then
+        # apply unified diff
+        ( cd "$(dirname "$target")" && patch -p1 --dry-run < "$work/patch.txt" \
+            && patch -p1 < "$work/patch.txt" ) 2>&1 | tee "$work/apply.log" \
+            || apply_rc=$?
+    else
+        apply_rc=99
+        echo "no apply path for fix_kind=$fix_kind target=$target" > "$work/apply.log"
+    fi
+    if [[ $apply_rc -eq 0 ]]; then
+        log "    ✓ applied — capturing metric_after"
+        sleep 5
+        python3 "$HFB/idempotency.py" record --plan "$work/patch.txt" \
+            --daemon sre --outcome applied >/dev/null 2>&1 || true
+        python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
+            --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
+            --response "$work/response.json" --verdict "$work/verdict.json" \
+            --applied true --outcome success \
+            --lesson "auto-heal worked first try" || true
+        notify "auto-healed $trigger (confidence=$conf)"
+    else
+        log "    ✗ apply failed rc=$apply_rc — rolling back"
+        # best-effort rollback hint logged but not auto-applied
+        python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
+            --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
+            --response "$work/response.json" --verdict "$work/verdict.json" \
+            --applied true --outcome rollback \
+            --lesson "apply rc=$apply_rc; rollback:$(python3 -c "import json; print(json.load(open('$work/response.json')).get('rollback','none')[:80])")" || true
+        notify "ROLLBACK $trigger rc=$apply_rc"
+    fi
+}
+# ��─ Probe 1: HF Space health ────────────────────────────────────────────────
+probe_space() {
+    local space="$1"
+    local resp; resp=$(curl -fsS --max-time 15 \
+        ${HF_TOKEN:+-H "Authorization: Bearer $HF_TOKEN"} \
+        "https://huggingface.co/api/spaces/$space" 2>/dev/null) || return 0
+    local stage err
+    stage=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin).get('runtime',{}).get('stage','UNKNOWN'))" 2>/dev/null)
+    err=$(echo "$resp"   | python3 -c "import json,sys; print(json.load(sys.stdin).get('runtime',{}).get('errorMessage','') or '')" 2>/dev/null)
+    case "$stage" in
+        RUNNING|BUILDING|CONFIG_ERROR_QUEUED|RUNNING_BUILDING) ;;  # nominal/expected
+        STOPPED|RUNTIME_ERROR|BUILD_ERROR|NO_APP_FILE|*ERROR*)
+            log "  ⚠ Space $space stage=$stage err=$err"
+            handle_anomaly "hf_space_${stage,,}" \
+                "$(printf '{"space":"%s","stage":"%s","error":%s}' \
+                    "$space" "$stage" "$(python3 -c "import json; print(json.dumps('$err'))")")"
+            ;;
+        *) log "  Space $space stage=$stage (no action)" ;;
+    esac
+}
+# ── Probe 2: dataset growth (staleness) ─────────────────────────────────────
+probe_dataset_staleness() {
+    local ds="$1"
+    local resp; resp=$(curl -fsS --max-time 15 \
+        ${HF_TOKEN:+-H "Authorization: Bearer $HF_TOKEN"} \
+        "https://huggingface.co/api/datasets/$ds" 2>/dev/null) || return 0
+    local last_modified
+    last_modified=$(echo "$resp" | python3 -c "
+import json, sys, datetime
+try:
+    d = json.load(sys.stdin)
+    lm = d.get('lastModified') or d.get('createdAt')
+    print(lm or '')
+except: print('')
+" 2>/dev/null)
+    [[ -z "$last_modified" ]] && return 0
+    local age_h
+    age_h=$(python3 -c "
+import datetime
+lm = datetime.datetime.fromisoformat('${last_modified}'.replace('Z','+00:00'))
+now = datetime.datetime.now(datetime.timezone.utc)
+print(int((now - lm).total_seconds() / 3600))
+" 2>/dev/null || echo 0)
+    if (( age_h > DATASET_STALE_HOURS )); then
+        log "  ⚠ dataset $ds stale ${age_h}h (threshold ${DATASET_STALE_HOURS}h)"
+        handle_anomaly "hf_dataset_stale" \
+            "$(printf '{"dataset":"%s","age_hours":%d,"threshold":%d}' \
+                "$ds" "$age_h" "$DATASET_STALE_HOURS")"
+    fi
+}
+# ── Probe 3: ZeroGPU smoke (cheapest health signal) ─────────────────────────
+probe_zerogpu_smoke() {
+    local space="$1"
+    local url="https://${space//\//-}.hf.space/api/predict"
+    if ! curl -fsS --max-time 30 -X POST -H "Content-Type: application/json" \
+            -d '{"data":["ping","hi",16,0.1]}' "$url" >/dev/null 2>&1; then
+        log "  ⚠ ZeroGPU smoke FAILED on $space"
+        handle_anomaly "zerogpu_smoke_fail" \
+            "$(printf '{"space":"%s","url":"%s"}' "$space" "$url")"
+    fi
+}
+# ── Probe 4: GH Actions failures (best-effort) ──────────────────────────────
+probe_gh_actions() {
+    if ! command -v gh >/dev/null 2>&1; then return 0; fi
+    for repo in axentx/arkashira axentx/midnightcrisis; do
+        local failed
+        failed=$(gh run list --repo "$repo" --limit 5 --json status,conclusion,name \
+            2>/dev/null | python3 -c "
+import json, sys
+try: runs = json.load(sys.stdin)
+except: runs = []
+fails = [r for r in runs if r.get('conclusion') == 'failure']
+print(len(fails))
+" 2>/dev/null || echo 0)
+        if (( failed >= 2 )); then
+            log "  ⚠ GH $repo: $failed of last 5 runs failed"
+            handle_anomaly "gh_workflow_repeated_failure" \
+                "$(printf '{"repo":"%s","failed_of_5":%d}' "$repo" "$failed")"
+        fi
+    done
+}
+# ── Probe 5: outcome log self-consistency (meta) ────────────────────────────
+probe_outcome_log_health() {
+    if [[ ! -f "$STATE/outcomes.jsonl" ]]; then return 0; fi
+    local n_recent_fail
+    n_recent_fail=$(tail -n 20 "$STATE/outcomes.jsonl" 2>/dev/null | python3 -c "
+import sys, json
+n = 0
+for L in sys.stdin:
+    try: r = json.loads(L)
+    except: continue
+    if r.get('outcome') in ('rollback','error'): n += 1
+print(n)
+" 2>/dev/null || echo 0)
+    if (( n_recent_fail >= 5 )); then
+        log "  ⚠ ${n_recent_fail}/20 recent outcomes failed → degrading mode"
+        notify "degrading: $n_recent_fail/20 recent fails — operator review"
+    fi
+}
+# ── Sweep ───────────────────────────────────────────────────────────────────
+sweep() {
+    log "═══ SRE sweep ═══"
+    probe_space "$SPACE_PRIMARY"
+    probe_space "$SPACE_SECONDARY"
+    for ds in "${DATASETS[@]}"; do probe_dataset_staleness "$ds"; done
+    probe_zerogpu_smoke "$SPACE_PRIMARY"
+    probe_gh_actions
+    probe_outcome_log_health
+    log "═══ sweep done ═══"
+}
+if (( ONCE )); then
+    sweep
+    exit 0
+fi
+log "═══ autonomous-sre starting (interval=${INTERVAL_SEC}s) ═══"
+notify "online — interval ${INTERVAL_SEC}s"
+while true; do
+    sweep
+    sleep "$INTERVAL_SEC"
+done

bin/v2/bench-v1-vs-v15.sh CHANGED Viewed

@@ -37,6 +37,7 @@ MODELS=(
     "v1|axentx/surrogate-1-coder-7b-v1|Qwen/Qwen2.5-Coder-7B-Instruct"
     "base7B|Qwen/Qwen2.5-Coder-7B-Instruct|"
     "v1.1-extended|axentx/surrogate-1-7B-v1.1-extended|Qwen/Qwen2.5-Coder-7B-Instruct"
 )
 # Bench ladder pivoted 2026-05-01 after V4 (32B OOM) + V5 (14B OOM) both
 # crashed Kaggle T4×2. Pick 7B as the validation base — fits T4×2 cleanly,
@@ -126,7 +127,7 @@ run_eval() {
     SWE_RESOLVED=$(grep -oE "resolved.*[0-9]+\.[0-9]+" "$out/swebench.log" 2>/dev/null | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
     # ── 7. axentx-eval-50 (custom in-domain DevSecOps eval) ──
-    log "  [7/7] axentx-eval-50 (custom DevSecOps)"
     if [[ -f "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" ]]; then
         python3 "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" \
             --model "$mdl" --out "$out/axentx-eval" 2>&1 | tee -a "$out/axentx-eval.log" | tail -30
@@ -135,6 +136,26 @@ run_eval() {
         AXENTX_SCORE="--"
     fi
     # Persist scores
     python3 - <<PYEOF
 import json
@@ -147,6 +168,8 @@ data["$label"] = {
     "ruler_16k_avg": "${RULER_AVG:-?}",
     "swebench_verified_lite100": "${SWE_RESOLVED:-?}",
     "axentx_eval_50": "${AXENTX_SCORE:-?}",
 }
 json.dump(data, open("$SUMMARY_JSON", "w"), indent=2)
 PYEOF

     "v1|axentx/surrogate-1-coder-7b-v1|Qwen/Qwen2.5-Coder-7B-Instruct"
     "base7B|Qwen/Qwen2.5-Coder-7B-Instruct|"
     "v1.1-extended|axentx/surrogate-1-7B-v1.1-extended|Qwen/Qwen2.5-Coder-7B-Instruct"
+    "v1.2-research|axentx/surrogate-1-7B-v1.2-research|Qwen/Qwen2.5-Coder-7B-Instruct"
 )
 # Bench ladder pivoted 2026-05-01 after V4 (32B OOM) + V5 (14B OOM) both
 # crashed Kaggle T4×2. Pick 7B as the validation base — fits T4×2 cleanly,
     SWE_RESOLVED=$(grep -oE "resolved.*[0-9]+\.[0-9]+" "$out/swebench.log" 2>/dev/null | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
     # ── 7. axentx-eval-50 (custom in-domain DevSecOps eval) ──
+    log "  [7/9] axentx-eval-50 (custom DevSecOps)"
     if [[ -f "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" ]]; then
         python3 "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" \
             --model "$mdl" --out "$out/axentx-eval" 2>&1 | tee -a "$out/axentx-eval.log" | tail -30
         AXENTX_SCORE="--"
     fi
+    # ── 8. Multi-IaC-Eval (NEW V8) — CFN+TF+CDK pass-rate w/ cfn-guard +tfsec ─
+    log "  [8/9] Multi-IaC-Eval (CFN/TF/CDK)"
+    if [[ -f "$HOME/.surrogate/hf-space/bin/v2/multi-iac-eval.py" ]]; then
+        python3 "$HOME/.surrogate/hf-space/bin/v2/multi-iac-eval.py" \
+            --model "$mdl" --out "$out/multi-iac" 2>&1 | tee -a "$out/multi-iac.log" | tail -30
+        MULTI_IAC=$(grep -oE "iac_pass_rate.*[0-9]+\.[0-9]+" "$out/multi-iac.log" | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
+    else
+        MULTI_IAC="--"
+    fi
+    # ── 9. ITBench-lite (NEW V8) — 102 K8s SRE/CISO/FinOps scenarios ──
+    log "  [9/9] ITBench-lite"
+    if [[ -f "$HOME/.surrogate/hf-space/bin/v2/itbench-lite.py" ]]; then
+        python3 "$HOME/.surrogate/hf-space/bin/v2/itbench-lite.py" \
+            --model "$mdl" --out "$out/itbench" 2>&1 | tee -a "$out/itbench.log" | tail -30
+        ITBENCH=$(grep -oE "itbench_score.*[0-9]+\.[0-9]+" "$out/itbench.log" | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
+    else
+        ITBENCH="--"
+    fi
     # Persist scores
     python3 - <<PYEOF
 import json
     "ruler_16k_avg": "${RULER_AVG:-?}",
     "swebench_verified_lite100": "${SWE_RESOLVED:-?}",
     "axentx_eval_50": "${AXENTX_SCORE:-?}",
+    "multi_iac_eval": "${MULTI_IAC:-?}",
+    "itbench_lite":   "${ITBENCH:-?}",
 }
 json.dump(data, open("$SUMMARY_JSON", "w"), indent=2)
 PYEOF

bin/v2/idempotency.py ADDED Viewed

	@@ -0,0 +1,118 @@

+#!/usr/bin/env python3
+"""Surrogate-1 — idempotency keys (research §autonomous-24x7 pattern 2).
+Every autonomous action computes idempotency_key = sha256(plan). If the
+same key has been seen within the TTL, the action is treated as already-
+applied and SKIPPED (preventing replay storms when the same anomaly fires
+twice in a row). Records live in a JSONL ledger.
+Ledger entry:
+    {"key":"<sha256>", "ts":"...", "daemon":"sre|release", "outcome":"applied|queued"}
+Usage:
+    # Check if seen recently — exit 0 if seen (skip), 1 if new
+    idempotency.py check --plan /path/to/plan.json --ttl-hours 4
+    # Record after applying
+    idempotency.py record --plan /path/to/plan.json \
+        --daemon sre --outcome applied
+"""
+from __future__ import annotations
+import argparse
+import datetime as dt
+import hashlib
+import json
+import os
+import sys
+from pathlib import Path
+LEDGER = Path(os.environ.get(
+    "SURROGATE_IDEMPOTENCY_LEDGER",
+    str(Path.home() / ".surrogate/state/idempotency.jsonl")))
+def compute_key(plan_path: Path) -> str:
+    txt = plan_path.read_text() if plan_path.is_file() else str(plan_path)
+    h = hashlib.sha256()
+    h.update(txt.encode())
+    return h.hexdigest()
+def load_ledger() -> list[dict]:
+    if not LEDGER.exists():
+        return []
+    out = []
+    for L in LEDGER.read_text().splitlines():
+        try:
+            out.append(json.loads(L))
+        except Exception:
+            continue
+    return out
+def append_ledger(rec: dict) -> None:
+    LEDGER.parent.mkdir(parents=True, exist_ok=True)
+    with LEDGER.open("a") as f:
+        f.write(json.dumps(rec) + "\n")
+def is_recent(key: str, ttl_hours: float) -> bool:
+    cutoff = dt.datetime.now(dt.timezone.utc) - dt.timedelta(hours=ttl_hours)
+    for r in load_ledger():
+        if r.get("key") != key:
+            continue
+        try:
+            ts = dt.datetime.strptime(r["ts"], "%Y-%m-%dT%H:%M:%SZ")
+        except Exception:
+            continue
+        if ts > cutoff:
+            return True
+    return False
+def main() -> int:
+    p = argparse.ArgumentParser()
+    sp = p.add_subparsers(dest="cmd", required=True)
+    pc = sp.add_parser("check")
+    pc.add_argument("--plan", required=True)
+    pc.add_argument("--ttl-hours", type=float, default=4.0)
+    pr = sp.add_parser("record")
+    pr.add_argument("--plan", required=True)
+    pr.add_argument("--daemon", required=True)
+    pr.add_argument("--outcome", required=True)
+    pk = sp.add_parser("key")
+    pk.add_argument("--plan", required=True)
+    args = p.parse_args()
+    if args.cmd == "key":
+        print(compute_key(Path(args.plan)))
+        return 0
+    key = compute_key(Path(args.plan))
+    if args.cmd == "check":
+        seen = is_recent(key, args.ttl_hours)
+        print(json.dumps({"key": key, "seen_recently": seen,
+                          "ttl_hours": args.ttl_hours}))
+        return 0 if seen else 1   # 0 = seen (skip); 1 = new (proceed)
+    if args.cmd == "record":
+        append_ledger({
+            "key": key,
+            "ts": dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+            "daemon": args.daemon,
+            "outcome": args.outcome,
+        })
+        print(f"recorded {key[:12]}…")
+        return 0
+    return 2
+if __name__ == "__main__":
+    sys.exit(main())

bin/v2/outcome-log.py ADDED Viewed

	@@ -0,0 +1,98 @@

+#!/usr/bin/env python3
+"""Surrogate-1 — outcome logger.
+All autonomous daemons (autonomous-sre, autonomous-release) call this
+after every action to append a structured record to the outcomes log.
+self-improve.sh reads that log to build the next training round's
+preference + SFT data.
+One JSONL record per action:
+    {
+        "ts":          "2026-05-01T12:34:56Z",
+        "daemon":      "sre" | "release",
+        "trigger":     "...probe_name...",
+        "anomaly":     {...probe details...},
+        "prompt":      "<full prompt sent to Surrogate>",
+        "response":    {...Surrogate's parsed JSON output...},
+        "verdict":     {...verifier-ensemble JSON...},
+        "applied":     true|false,
+        "outcome":     "success" | "rollback" | "queued" | "rejected",
+        "metric_after": {...optional post-action observation...},
+        "lesson":      "optional one-line takeaway"
+    }
+Usage:
+    outcome-log.py --daemon sre --trigger hf_space_stage_failed \
+        --anomaly /tmp/anomaly.json \
+        --prompt /tmp/prompt.md \
+        --response /tmp/response.json \
+        --verdict /tmp/verdict.json \
+        --applied true --outcome success \
+        [--lesson "factory_reboot fixed stuck Space"]
+"""
+from __future__ import annotations
+import argparse
+import datetime as dt
+import json
+import os
+import sys
+from pathlib import Path
+LOG_PATH = Path(os.environ.get(
+    "SURROGATE_OUTCOME_LOG",
+    str(Path.home() / ".surrogate/state/outcomes.jsonl")))
+def _maybe_load(p: str | None) -> object | None:
+    if not p:
+        return None
+    pp = Path(p)
+    if not pp.exists():
+        return p   # treat as inline string
+    txt = pp.read_text()
+    try:
+        return json.loads(txt)
+    except Exception:
+        return txt   # not JSON → store raw
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--daemon", required=True, choices=["sre", "release", "manual"])
+    p.add_argument("--trigger", required=True)
+    p.add_argument("--anomaly", default=None,
+                   help="path to JSON file or inline string")
+    p.add_argument("--prompt", default=None)
+    p.add_argument("--response", default=None)
+    p.add_argument("--verdict", default=None)
+    p.add_argument("--applied", choices=["true", "false"], required=True)
+    p.add_argument("--outcome", required=True,
+                   choices=["success", "rollback", "queued", "rejected", "error"])
+    p.add_argument("--metric-after", default=None)
+    p.add_argument("--lesson", default=None)
+    args = p.parse_args()
+    rec = {
+        "ts": dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
+        "daemon": args.daemon,
+        "trigger": args.trigger,
+        "anomaly": _maybe_load(args.anomaly),
+        "prompt": _maybe_load(args.prompt),
+        "response": _maybe_load(args.response),
+        "verdict": _maybe_load(args.verdict),
+        "applied": args.applied == "true",
+        "outcome": args.outcome,
+        "metric_after": _maybe_load(args.metric_after),
+        "lesson": args.lesson,
+    }
+    LOG_PATH.parent.mkdir(parents=True, exist_ok=True)
+    with LOG_PATH.open("a") as f:
+        f.write(json.dumps(rec, ensure_ascii=False) + "\n")
+    print(f"logged outcome: {args.daemon}/{args.trigger} → {args.outcome} "
+          f"(applied={args.applied})", file=sys.stderr)
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

bin/v2/self-improve.sh ADDED Viewed

	@@ -0,0 +1,283 @@

+#!/usr/bin/env bash
+# Surrogate-1 — self-improvement data flywheel.
+#
+# Reads outcomes.jsonl (produced by autonomous-sre.sh + autonomous-release.sh)
+# and converts the success/failure signal into training data for the next
+# round, then triggers a refresh when accumulation crosses thresholds.
+#
+# Pipeline (from research §self-improvement.md, cron cadence aligned to
+# research recommendations):
+#   1. Aggregate outcomes since last run.
+#   2. Split into:
+#        SUCCESS  = applied && outcome=success
+#        FAIL     = applied && outcome in (rollback, error)
+#        REJECTED = !applied — verifier blocked it
+#   3. Build 3 datasets:
+#        a) SFT replay — SUCCESS only, formatted as prompt/response pairs
+#                        (RLEF-style: model wrote it, executor approved)
+#        b) KTO unpaired — every outcome with binary "thumbs" label
+#                        (KTO doesn't need pairs — lossless on logs)
+#        c) Skill library — verified procedures from SUCCESS, indexed by topic
+#   4. Push to HF Hub:
+#        axentx/surrogate-1-self-traces (SFT)
+#        axentx/surrogate-1-pref-kto    (KTO)
+#        axentx/surrogate-1-skills      (skill library)
+#   5. If SFT pairs ≥ SFT_TRIGGER_N OR KTO ≥ KTO_TRIGGER_N → kick training:
+#        - Bumps Kaggle kernel version (or notifies user to upload)
+#        - Logs decision to outcomes.jsonl with daemon=manual trigger=self-improve
+#
+# Cadence (from research):
+#   - SFT replay        weekly Sun 5am       (cheap)
+#   - KTO refresh       biweekly             (1st + 15th)
+#   - Skill index       daily 4am            (free)
+#   - Trigger train     when thresholds met
+#
+# Usage:
+#   bash bin/v2/self-improve.sh             # run all stages, idempotent
+#   bash bin/v2/self-improve.sh sft         # just SFT replay
+#   bash bin/v2/self-improve.sh kto         # just KTO build
+#   bash bin/v2/self-improve.sh skills      # just skill library
+#   bash bin/v2/self-improve.sh status      # report counts only
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+HFB="$HOME/.surrogate/hf-space/bin/v2"
+STATE="$HOME/.surrogate/state"
+OUTCOMES="$STATE/outcomes.jsonl"
+WORK="$STATE/self-improve"
+LOG="$HOME/.surrogate/logs/self-improve.log"
+mkdir -p "$WORK" "$(dirname "$LOG")"
+CMD="${1:-all}"
+# Trigger thresholds — research recommends weekly SFT @ ~$14 H200 cost
+SFT_TRIGGER_N="${SI_SFT_TRIGGER_N:-200}"
+KTO_TRIGGER_N="${SI_KTO_TRIGGER_N:-500}"
+# HF Hub repos for the three artifact streams
+SFT_REPO="${SI_SFT_REPO:-axentx/surrogate-1-self-traces}"
+KTO_REPO="${SI_KTO_REPO:-axentx/surrogate-1-pref-kto}"
+SKILL_REPO="${SI_SKILL_REPO:-axentx/surrogate-1-skills}"
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "{\"content\":\"♻️ self-improve: $1\"}" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+# ── Stage: status report ────────────────────────────────────────────────────
+status() {
+    if [[ ! -f "$OUTCOMES" ]]; then
+        log "no outcomes.jsonl yet — daemons haven't logged anything"
+        return 0
+    fi
+    python3 - <<PYEOF
+import json, collections
+from pathlib import Path
+n = collections.Counter()
+by_daemon = collections.Counter()
+trigger = collections.Counter()
+for L in Path("$OUTCOMES").read_text().splitlines():
+    try: r = json.loads(L)
+    except: continue
+    n[r.get("outcome","?")] += 1
+    by_daemon[r.get("daemon","?")] += 1
+    trigger[r.get("trigger","?")] += 1
+print(f"  total outcomes: {sum(n.values())}")
+print(f"  by outcome: {dict(n)}")
+print(f"  by daemon:  {dict(by_daemon)}")
+print(f"  top triggers:")
+for t, c in trigger.most_common(8):
+    print(f"    {c:4d} {t}")
+PYEOF
+}
+# ── Stage: SFT replay (RLEF-aligned) ────────────────────────────────────────
+build_sft() {
+    log "── SFT replay build ──"
+    [[ ! -f "$OUTCOMES" ]] && { log "  no outcomes file — skip"; return 0; }
+    python3 - <<'PYEOF' "$OUTCOMES" "$WORK/sft.jsonl"
+import json, sys
+from pathlib import Path
+src, dst = sys.argv[1], sys.argv[2]
+n_in = n_out = 0
+with open(dst, "w") as out:
+    for L in Path(src).read_text().splitlines():
+        n_in += 1
+        try: r = json.loads(L)
+        except: continue
+        if not r.get("applied"): continue
+        if r.get("outcome") != "success": continue
+        # The model's diagnosis/spec/patch IS the response. The trigger +
+        # anomaly together form the prompt.
+        prompt = (
+            f"You are Surrogate-1 in {r.get('daemon','?')} mode.\n"
+            f"Trigger: {r.get('trigger','?')}\n"
+            f"Anomaly:\n```json\n{json.dumps(r.get('anomaly'), indent=2)}\n```\n"
+            f"Output a JSON action with diagnosis + patch."
+        )
+        resp = r.get("response")
+        if not isinstance(resp, dict): continue
+        out.write(json.dumps({
+            "prompt":  prompt,
+            "response": json.dumps(resp, indent=2),
+            "source":   "self-trace",
+            "ts":       r.get("ts"),
+            "trigger":  r.get("trigger"),
+            "lesson":   r.get("lesson"),
+        }, ensure_ascii=False) + "\n")
+        n_out += 1
+print(f"  SFT pairs: {n_out} (read {n_in})")
+PYEOF
+    local n; n=$(wc -l < "$WORK/sft.jsonl" | tr -d ' ')
+    log "  → $WORK/sft.jsonl ($n pairs)"
+    if (( n >= SFT_TRIGGER_N )); then
+        log "  threshold met ($n ≥ $SFT_TRIGGER_N) — pushing + flagging trigger"
+        push_dataset "$SFT_REPO" "$WORK/sft.jsonl"
+        trigger_next_round "sft" "$n"
+    else
+        log "  below trigger ($n < $SFT_TRIGGER_N) — accumulating"
+    fi
+}
+# ── Stage: KTO unpaired preferences ─────────────────────────────────────────
+build_kto() {
+    log "── KTO unpaired build ──"
+    [[ ! -f "$OUTCOMES" ]] && { log "  no outcomes file — skip"; return 0; }
+    python3 - <<'PYEOF' "$OUTCOMES" "$WORK/kto.jsonl"
+import json, sys
+from pathlib import Path
+src, dst = sys.argv[1], sys.argv[2]
+n = 0
+with open(dst, "w") as out:
+    for L in Path(src).read_text().splitlines():
+        try: r = json.loads(L)
+        except: continue
+        oc = r.get("outcome")
+        if oc not in ("success","rollback","error","queued","rejected"): continue
+        # KTO label: True = applied & success, False = anything else
+        label = bool(r.get("applied")) and (oc == "success")
+        prompt = (
+            f"trigger={r.get('trigger','?')} daemon={r.get('daemon','?')}\n"
+            f"anomaly={json.dumps(r.get('anomaly'))[:400]}"
+        )
+        resp = r.get("response")
+        if not isinstance(resp, dict): continue
+        out.write(json.dumps({
+            "prompt":     prompt,
+            "completion": json.dumps(resp)[:2000],
+            "label":      label,
+            "ts":         r.get("ts"),
+        }, ensure_ascii=False) + "\n")
+        n += 1
+print(f"  KTO rows: {n}")
+PYEOF
+    local n; n=$(wc -l < "$WORK/kto.jsonl" | tr -d ' ')
+    log "  → $WORK/kto.jsonl ($n rows)"
+    if (( n >= KTO_TRIGGER_N )); then
+        log "  threshold met — pushing"
+        push_dataset "$KTO_REPO" "$WORK/kto.jsonl"
+        trigger_next_round "kto" "$n"
+    fi
+}
+# ── Stage: skill library ────────────────────────────────────────────────────
+build_skills() {
+    log "── skill library build ──"
+    [[ ! -f "$OUTCOMES" ]] && { log "  no outcomes file — skip"; return 0; }
+    python3 - <<'PYEOF' "$OUTCOMES" "$WORK/skills.jsonl"
+import json, sys, collections
+from pathlib import Path
+src, dst = sys.argv[1], sys.argv[2]
+# Group successful patches by trigger keyword to form a skill = (keyword, top-N successful patches)
+groups = collections.defaultdict(list)
+for L in Path(src).read_text().splitlines():
+    try: r = json.loads(L)
+    except: continue
+    if not (r.get("applied") and r.get("outcome") == "success"): continue
+    resp = r.get("response")
+    if not isinstance(resp, dict): continue
+    trig = r.get("trigger","misc").split(":")[0]
+    groups[trig].append({
+        "patch":    resp.get("patch",""),
+        "rollback": resp.get("rollback",""),
+        "test_plan": resp.get("test_plan",""),
+        "ts": r.get("ts"),
+    })
+n = 0
+with open(dst, "w") as out:
+    for trig, items in groups.items():
+        items.sort(key=lambda x: x.get("ts",""), reverse=True)
+        out.write(json.dumps({
+            "skill":   trig,
+            "n_examples": len(items),
+            "examples": items[:5],   # keep top 5 most-recent
+        }, ensure_ascii=False) + "\n")
+        n += 1
+print(f"  skills: {n}")
+PYEOF
+    local n; n=$(wc -l < "$WORK/skills.jsonl" | tr -d ' ')
+    log "  → $WORK/skills.jsonl ($n skills)"
+    if (( n > 0 )); then
+        push_dataset "$SKILL_REPO" "$WORK/skills.jsonl"
+    fi
+}
+# ── Push to HF Hub via huggingface_hub Python API ───────────────────────────
+push_dataset() {
+    local repo="$1" path="$2"
+    if [[ -z "${HF_TOKEN:-}" ]]; then
+        log "  HF_TOKEN missing — saving locally only"
+        return 0
+    fi
+    python3 - <<PYEOF
+import os
+from huggingface_hub import HfApi, create_repo
+api = HfApi(token=os.environ["HF_TOKEN"])
+try:
+    create_repo("$repo", repo_type="dataset", exist_ok=True, private=False)
+except Exception as e:
+    print(f"  create_repo: {type(e).__name__}: {e}")
+api.upload_file(
+    path_or_fileobj="$path",
+    path_in_repo="$(basename "$path")",
+    repo_id="$repo",
+    repo_type="dataset",
+    commit_message="self-improve: $(basename "$path") $(date -u +%Y%m%dT%H%MZ)",
+)
+print(f"  pushed → https://huggingface.co/datasets/$repo")
+PYEOF
+}
+# ── Trigger next training round ─────────────────────────────────────────────
+trigger_next_round() {
+    local stage="$1" n="$2"
+    log "  TRIGGER next training round (stage=$stage n=$n)"
+    notify "$stage threshold reached ($n) — flagging next training round"
+    python3 "$HFB/outcome-log.py" --daemon manual --trigger "self-improve-trigger-$stage" \
+        --applied false --outcome queued \
+        --lesson "$stage threshold reached ($n) — V8 training queued" || true
+    # If Kaggle CLI ever returns to a working state, this is where we'd
+    # call `kaggle kernels push`. For now, write a flag file the user
+    # checks manually.
+    echo "$(date -u +%Y%m%dT%H%MZ) $stage n=$n" >> "$STATE/training-queue.log"
+}
+# ── Dispatcher ──────────────────────────────────────────────────────────────
+case "$CMD" in
+    status)  status ;;
+    sft)     build_sft ;;
+    kto)     build_kto ;;
+    skills)  build_skills ;;
+    all)
+        status
+        build_skills
+        build_sft
+        build_kto
+        ;;
+    *) echo "usage: $0 {all|sft|kto|skills|status}" >&2; exit 2 ;;
+esac
+log "done"

bin/v2/surrogate-call.py ADDED Viewed

	@@ -0,0 +1,177 @@

+#!/usr/bin/env python3
+"""Surrogate-1 — single-shot call to the ZeroGPU endpoint with strict JSON parse.
+Used by autonomous-sre.sh + autonomous-release.sh to ask Surrogate-1 for
+a structured diagnosis/spec/patch. Returns parsed JSON on stdout, exits 0
+if the response is valid + matches the schema, else non-zero.
+Usage:
+    surrogate-call.py \
+        --space surrogate1/surrogate-1-zero-gpu \
+        --prompt-file /tmp/prompt.md \
+        --schema diagnosis|spec|patch \
+        [--max-tokens 1024] [--temperature 0.2] \
+        [--retries 2] [--out /tmp/response.json]
+Env:
+    HF_TOKEN (or HF_TOKEN_PRO)         — required for private/queued Space
+    SURROGATE_TIMEOUT_SEC=120          — per-call timeout
+    SURROGATE_RETRY_BACKOFF_SEC=15     — sleep between retries
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+import sys
+import time
+from pathlib import Path
+from urllib import request, error
+TIMEOUT = int(os.environ.get("SURROGATE_TIMEOUT_SEC", "120"))
+BACKOFF = int(os.environ.get("SURROGATE_RETRY_BACKOFF_SEC", "15"))
+SCHEMAS = {
+    "diagnosis": {
+        "required": ["diagnosis", "fix_kind", "confidence"],
+        "fix_kind_enum": ["code", "iac", "shell", "sql", "none"],
+        "extras": ["patch", "target_file", "rollback", "test_plan"],
+    },
+    "spec": {
+        "required": ["title", "problem", "user_stories",
+                     "acceptance_criteria", "impact", "confidence"],
+        "extras": ["competitors_observed", "out_of_scope", "rollout_plan"],
+    },
+    "patch": {
+        "required": ["target_file", "patch", "kind",
+                     "test_plan", "rollback", "confidence"],
+        "extras": ["fix_kind", "diagnosis"],
+    },
+}
+def _hf_token() -> str | None:
+    return (os.environ.get("HF_TOKEN")
+            or os.environ.get("HF_TOKEN_PRO")
+            or os.environ.get("HF_TOKEN_PRO_WRITE"))
+def _post_json(url: str, body: dict, token: str | None) -> dict:
+    headers = {"Content-Type": "application/json"}
+    if token:
+        headers["Authorization"] = f"Bearer {token}"
+    req = request.Request(url, data=json.dumps(body).encode(),
+                          headers=headers, method="POST")
+    with request.urlopen(req, timeout=TIMEOUT) as resp:
+        return json.loads(resp.read().decode())
+def _extract_json(text: str) -> dict | None:
+    # Try fenced ```json … ``` first, then loose {...} sweep
+    m = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, flags=re.S)
+    candidates = [m.group(1)] if m else []
+    # also try the longest balanced {..} substring
+    depth = 0; start = -1; longest = ""
+    for i, ch in enumerate(text):
+        if ch == "{":
+            if depth == 0:
+                start = i
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0 and start >= 0:
+                blob = text[start:i + 1]
+                if len(blob) > len(longest):
+                    longest = blob
+    if longest:
+        candidates.append(longest)
+    for c in candidates:
+        try:
+            return json.loads(c)
+        except Exception:
+            continue
+    return None
+def _validate(parsed: dict, schema: str) -> tuple[bool, str]:
+    spec = SCHEMAS.get(schema)
+    if not spec:
+        return False, f"unknown schema: {schema}"
+    missing = [k for k in spec["required"] if k not in parsed]
+    if missing:
+        return False, f"missing required keys: {missing}"
+    if schema == "diagnosis":
+        if parsed.get("fix_kind") not in spec["fix_kind_enum"]:
+            return False, f"fix_kind must be one of {spec['fix_kind_enum']}"
+    try:
+        c = float(parsed.get("confidence", -1))
+        if not (0.0 <= c <= 1.0):
+            return False, f"confidence out of [0,1]: {c}"
+    except Exception:
+        return False, "confidence not numeric"
+    return True, "ok"
+def _call_gradio(space: str, prompt: str, max_tokens: int,
+                 temperature: float) -> str:
+    # Most Surrogate ZeroGPU Spaces expose /run/predict or /api/predict.
+    # Try modern /api/predict first, fall back to /run/predict.
+    base = f"https://{space.replace('/', '-')}.hf.space"
+    body = {"data": [prompt, "", max_tokens, temperature]}
+    for path in ("/api/predict", "/run/predict"):
+        try:
+            r = _post_json(base + path, body, _hf_token())
+            if isinstance(r, dict) and "data" in r and r["data"]:
+                first = r["data"][0]
+                if isinstance(first, str):
+                    return first
+                if isinstance(first, list) and first:
+                    return str(first[0])
+            return json.dumps(r)
+        except error.HTTPError as e:
+            if e.code in (404, 405):
+                continue
+            raise
+    raise RuntimeError(f"no working endpoint on {base}")
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--space", required=True,
+                   help="HF Space repo, e.g. surrogate1/surrogate-1-zero-gpu")
+    p.add_argument("--prompt-file", required=True)
+    p.add_argument("--schema", required=True, choices=list(SCHEMAS.keys()))
+    p.add_argument("--max-tokens", type=int, default=1024)
+    p.add_argument("--temperature", type=float, default=0.2)
+    p.add_argument("--retries", type=int, default=2)
+    p.add_argument("--out", default=None)
+    args = p.parse_args()
+    prompt = Path(args.prompt_file).read_text()
+    last_err = ""
+    for attempt in range(args.retries + 1):
+        try:
+            raw = _call_gradio(args.space, prompt, args.max_tokens, args.temperature)
+            parsed = _extract_json(raw)
+            if parsed is None:
+                last_err = f"no JSON in response (preview: {raw[:200]})"
+            else:
+                ok, msg = _validate(parsed, args.schema)
+                if ok:
+                    out = json.dumps(parsed, indent=2)
+                    print(out)
+                    if args.out:
+                        Path(args.out).write_text(out)
+                    return 0
+                last_err = f"schema validation failed: {msg}"
+        except Exception as e:
+            last_err = f"{type(e).__name__}: {e}"
+        if attempt < args.retries:
+            time.sleep(BACKOFF * (attempt + 1))
+    sys.stderr.write(f"surrogate-call failed: {last_err}\n")
+    return 2
+if __name__ == "__main__":
+    sys.exit(main())

bin/v2/verifier-ensemble.py ADDED Viewed

	@@ -0,0 +1,404 @@

+#!/usr/bin/env python3
+"""Surrogate-1 — verifier ensemble (single source of truth for "safe to apply").
+Used by autonomous-sre.sh + autonomous-release.sh BEFORE any action is
+applied to the user's real systems. Returns a JSON verdict; the caller
+applies only if verdict.ok == True.
+Layers (each returns PASS / FAIL / SKIP):
+    1. ast        — Python/JS AST parses
+    2. lint       — ruff (.py) / eslint (.ts/.js) / shellcheck (.sh) / hadolint (Dockerfile)
+                    / cfn-lint (CF) / tfsec (TF)
+    3. typecheck  — mypy / tsc / pyright if config present
+    4. tests      — pytest -k test_<topic> if tests dir present
+    5. policy     — refuse-list of destructive patterns (rm -rf /, DROP DATABASE,
+                    iam:* on Resource: "*", DELETE FROM <table> without WHERE…)
+    6. security   — semgrep --config=p/ci, gitleaks for secrets, prowler if cf/tf
+    7. diff       — change must be reversible, scoped, ≤ MAX_LINES_CHANGED
+    8. sandbox    — exec in throwaway docker/E2B if marked executable
+    9. confidence — caller passes model logprob; threshold check
+DECISION:
+    ALL non-SKIP must be PASS, AND at least MIN_VERIFIERS_RUN actually executed.
+    Any FAIL → ok=False with reasons.
+Usage:
+    verifier-ensemble.py \
+        --change /path/to/patch.diff \
+        --target /path/to/file/being/changed \
+        --kind iac|code|sql|shell \
+        --confidence 0.92 \
+        --out /tmp/verdict.json
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+import shlex
+import subprocess
+import sys
+import tempfile
+from dataclasses import dataclass, field, asdict
+from pathlib import Path
+MIN_VERIFIERS_RUN = int(os.environ.get("VERIFIER_MIN_RUN", "3"))
+MAX_LINES_CHANGED = int(os.environ.get("VERIFIER_MAX_LINES", "300"))
+CONFIDENCE_FLOOR = float(os.environ.get("VERIFIER_CONFIDENCE_FLOOR", "0.55"))
+# Hard refuse list — patterns that auto-FAIL regardless of other checks.
+# Each entry: (regex, reason). Sourced from research (autonomous-24x7.md
+# §HardGuards) — 14+ canonical rules. NEVER auto-override these in code.
+REFUSE_PATTERNS = [
+    # 1. Filesystem destruction
+    (r"\brm\s+-rf\s+/(?!tmp|var/tmp|home/[^/]+/\.surrogate)", "rm -rf on real fs root"),
+    (r"\bchmod\s+-R\s+777\s+/(?!tmp)", "chmod 777 outside /tmp"),
+    (r"\bchown\s+-R\s+\S+\s+/(?!tmp|home/)", "chown -R on system path"),
+    # 2. Database destruction
+    (r"\bDROP\s+(DATABASE|TABLE|SCHEMA)\b", "destructive SQL DDL"),
+    (r"\bDELETE\s+FROM\b(?![^;]*\bWHERE\b)", "DELETE without WHERE"),
+    (r"\bTRUNCATE\s+TABLE\b", "TRUNCATE TABLE"),
+    # 3. IaC destructive ops on prod
+    (r"\bterraform\s+destroy\b", "terraform destroy"),
+    (r"\bterraform\s+(apply|plan).*\bworkspace.*\bprd\b", "terraform on prd workspace"),
+    (r"\bcdk\s+destroy\b.*\b(prd|prod)\b", "cdk destroy on prod"),
+    # 4. Cloud destructive ops
+    (r"\baws\s+s3\s+rb\s+--force\b", "aws s3 rb --force"),
+    (r"\baws\s+ec2\s+terminate-instances\b(?!.*--dry-run)", "ec2 terminate w/o dry-run"),
+    (r"\baws\s+rds\s+delete-db-instance\b(?!.*--final-db-snapshot-identifier)",
+        "rds delete w/o final snapshot"),
+    (r"\baws\s+route53\s+change-resource-record-sets\b.*\bDELETE\b", "Route53 DELETE record"),
+    # 5. Kubernetes destructive ops
+    (r"\bkubectl\s+delete\s+ns\b", "kubectl delete namespace"),
+    (r"\bkubectl\s+delete\s+\S+\s+\S*prod\S*\b", "kubectl delete *prod*"),
+    (r"\bhelm\s+install\b.*\b(http://|registry\.\S+)\b(?!.*allowlist)",
+        "helm install from non-allowlist registry"),
+    # 6. Git/source destruction
+    (r"\bgit\s+push\s+(--force|--force-with-lease).*\b(main|master|prod)\b",
+        "force-push to main/prod"),
+    (r"\bgit\s+filter-(branch|repo)\b", "git history rewrite"),
+    # 7. IAM / auth weakening
+    (r'"Action"\s*:\s*"\*".*"Resource"\s*:\s*"\*"', "IAM Allow * on *"),
+    (r'"Effect"\s*:\s*"Allow".*"Principal"\s*:\s*"\*"', "IAM Allow Principal *"),
+    (r"\baws\s+iam\s+(delete-user|delete-role|update-assume-role-policy)\b.*\b(admin|root|prod)\b",
+        "IAM destructive op on privileged identity"),
+    (r"\baws\s+ec2\s+revoke-security-group-(ingress|egress)\b.*\bprod\b",
+        "revoke prod SG rule"),
+    # 8. Disk / network
+    (r"\bdd\s+if=/dev/(zero|random)\s+of=/dev/[shv]d", "raw disk overwrite"),
+    (r"\biptables\s+-F\b", "iptables flush"),
+    # 9. Untrusted execution
+    (r"\b(curl|wget)\b\s+\S+\s*\|\s*(sudo\s+)?(bash|sh|zsh|python\d?)\b",
+        "curl | sh from network"),
+    (r"\bnpx\s+\S+\b(?!.*--package-lock-only)", "npx of untrusted package"),
+    # 10. Secrets in patch (must never land)
+    (r"AKIA[0-9A-Z]{16}", "AWS access key in patch"),
+    (r"-----BEGIN\s+(RSA|OPENSSH|EC|DSA)\s+PRIVATE\s+KEY-----", "private key in patch"),
+    (r"\bsk-[A-Za-z0-9]{32,}", "OpenAI/Anthropic-style API key"),
+    (r"\bhf_[A-Za-z0-9]{34}\b", "HuggingFace token in patch"),
+    # 11. MFA / security degradation
+    (r"\baws\s+iam\s+deactivate-mfa-device\b", "MFA deactivation"),
+    (r'"MultiFactorAuthPresent"\s*:\s*\{\s*"Bool"\s*:\s*"false"', "IAM bypass MFA"),
+    # 12. Helm / supply-chain risk
+    (r"\bdocker\s+pull\s+\S+(?!.*@sha256:)", "docker pull without digest pin"),
+]
+# Destructive-class actions require >=0.95 confidence (from research §HardGuards)
+DESTRUCTIVE_KEYWORDS = (
+    "destroy", "delete", "drop", "truncate", "force-push", "rm -rf",
+    "terminate", "revoke", "deactivate-mfa", "filter-branch",
+)
+DESTRUCTIVE_CONFIDENCE_FLOOR = float(
+    os.environ.get("VERIFIER_DESTRUCTIVE_FLOOR", "0.95"))
+def _is_destructive(change: str) -> bool:
+    low = change.lower()
+    return any(kw in low for kw in DESTRUCTIVE_KEYWORDS)
+@dataclass
+class CheckResult:
+    name: str
+    status: str  # PASS / FAIL / SKIP
+    detail: str = ""
+    def passed(self) -> bool:
+        return self.status == "PASS"
+    def failed(self) -> bool:
+        return self.status == "FAIL"
+@dataclass
+class Verdict:
+    ok: bool
+    reasons: list[str] = field(default_factory=list)
+    checks: list[CheckResult] = field(default_factory=list)
+    n_pass: int = 0
+    n_fail: int = 0
+    n_skip: int = 0
+def _run(cmd: list[str], timeout: int = 60, cwd: str | None = None) -> tuple[int, str, str]:
+    try:
+        p = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, cwd=cwd)
+        return p.returncode, p.stdout, p.stderr
+    except subprocess.TimeoutExpired:
+        return 124, "", "timeout"
+    except FileNotFoundError:
+        return 127, "", f"binary not found: {cmd[0]}"
+def _have(binary: str) -> bool:
+    return _run(["which", binary])[0] == 0
+# ── Layer 1: AST parse ──────────────────────────────────────────────────────
+def check_ast(target: Path, kind: str) -> CheckResult:
+    if not target.exists():
+        return CheckResult("ast", "SKIP", "target file does not exist yet")
+    if kind == "code" and target.suffix == ".py":
+        try:
+            import ast
+            ast.parse(target.read_text())
+            return CheckResult("ast", "PASS", "python AST parses")
+        except SyntaxError as e:
+            return CheckResult("ast", "FAIL", f"py syntax: {e}")
+    if kind == "code" and target.suffix in (".js", ".ts", ".tsx", ".jsx"):
+        if _have("node"):
+            rc, _, err = _run(["node", "--check", str(target)], timeout=15)
+            return CheckResult("ast", "PASS" if rc == 0 else "FAIL", err.strip()[:200] or "ok")
+        return CheckResult("ast", "SKIP", "node not installed")
+    if kind == "shell" or target.suffix == ".sh":
+        rc, _, err = _run(["bash", "-n", str(target)], timeout=15)
+        return CheckResult("ast", "PASS" if rc == 0 else "FAIL", err.strip()[:200] or "ok")
+    if kind == "iac" and target.suffix in (".yml", ".yaml", ".json"):
+        try:
+            txt = target.read_text()
+            if target.suffix == ".json":
+                json.loads(txt)
+            else:
+                import yaml  # type: ignore
+                yaml.safe_load(txt)
+            return CheckResult("ast", "PASS", "yaml/json parses")
+        except Exception as e:
+            return CheckResult("ast", "FAIL", f"parse: {e}")
+    return CheckResult("ast", "SKIP", f"no AST parser for {target.suffix} (kind={kind})")
+# ── Layer 2: lint ───────────────────────────────────────────────────────────
+def check_lint(target: Path, kind: str) -> CheckResult:
+    if not target.exists():
+        return CheckResult("lint", "SKIP", "no file")
+    sx = target.suffix
+    if sx == ".py" and _have("ruff"):
+        rc, out, _ = _run(["ruff", "check", str(target), "--quiet"], timeout=30)
+        return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
+    if sx == ".sh" and _have("shellcheck"):
+        rc, out, _ = _run(["shellcheck", "-S", "warning", str(target)], timeout=30)
+        return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
+    if target.name in ("Dockerfile",) and _have("hadolint"):
+        rc, out, _ = _run(["hadolint", "--no-fail", str(target)], timeout=30)
+        return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300])
+    if kind == "iac" and "cf" in str(target).lower() and _have("cfn-lint"):
+        rc, out, _ = _run(["cfn-lint", str(target)], timeout=60)
+        return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
+    if kind == "iac" and target.suffix == ".tf" and _have("tflint"):
+        rc, out, _ = _run(["tflint", str(target)], timeout=60)
+        return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
+    return CheckResult("lint", "SKIP", "no linter for file type or binary missing")
+# ── Layer 3: typecheck ──────────────────────────────────────────────────────
+def check_typecheck(target: Path, kind: str) -> CheckResult:
+    if not target.exists() or kind != "code":
+        return CheckResult("typecheck", "SKIP", "n/a")
+    if target.suffix == ".py" and _have("mypy"):
+        rc, out, _ = _run(["mypy", "--ignore-missing-imports", "--no-error-summary",
+                           str(target)], timeout=45)
+        return CheckResult("typecheck", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "ok")
+    if target.suffix in (".ts", ".tsx") and _have("tsc"):
+        rc, out, _ = _run(["tsc", "--noEmit", "--allowJs", str(target)], timeout=60)
+        return CheckResult("typecheck", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "ok")
+    return CheckResult("typecheck", "SKIP", "no typechecker available")
+# ── Layer 4: tests ──────────────────────────────────────────────────────────
+def check_tests(target: Path, kind: str) -> CheckResult:
+    repo = target.parent
+    while repo != repo.parent and not (repo / ".git").exists():
+        repo = repo.parent
+    if not (repo / ".git").exists():
+        return CheckResult("tests", "SKIP", "not a git repo")
+    test_dir = next((repo / d for d in ("tests", "test", "__tests__") if (repo / d).is_dir()), None)
+    if test_dir is None:
+        return CheckResult("tests", "SKIP", "no tests/ dir")
+    if _have("pytest"):
+        rc, out, _ = _run(["pytest", "-x", "--tb=line", "-q", str(test_dir)],
+                          timeout=180, cwd=str(repo))
+        return CheckResult("tests", "PASS" if rc == 0 else "FAIL",
+                           out.strip().splitlines()[-1][:200] if out else "no output")
+    return CheckResult("tests", "SKIP", "pytest not installed")
+# ── Layer 5: policy (refuse-list) ───────────────────────────────────────────
+def check_policy(change: str) -> CheckResult:
+    hits = []
+    for pat, reason in REFUSE_PATTERNS:
+        if re.search(pat, change, flags=re.IGNORECASE):
+            hits.append(reason)
+    if hits:
+        return CheckResult("policy", "FAIL", f"refused: {'; '.join(hits)}")
+    return CheckResult("policy", "PASS", "no refuse-list patterns matched")
+# ── Layer 6: security ───────────────────────────────────────────────────────
+def check_security(target: Path, change: str) -> CheckResult:
+    detail = []
+    # secrets — gitleaks if available, else regex fallback
+    if _have("gitleaks"):
+        with tempfile.NamedTemporaryFile("w", suffix=".diff", delete=False) as f:
+            f.write(change); patch = f.name
+        rc, out, _ = _run(["gitleaks", "detect", "--no-git", "--source", patch,
+                           "--report-format", "json"], timeout=30)
+        if rc != 0 and out.strip() and out.strip() != "[]":
+            detail.append(f"gitleaks hit: {out[:200]}")
+    else:
+        for pat in (r"AKIA[0-9A-Z]{16}", r"AIza[0-9A-Za-z\-_]{35}",
+                    r"sk-[a-zA-Z0-9]{32,}", r"hf_[a-zA-Z0-9]{34}"):
+            if re.search(pat, change):
+                detail.append(f"secret pattern: {pat[:20]}…")
+    # semgrep
+    if _have("semgrep") and target.exists():
+        rc, out, _ = _run(["semgrep", "--config=p/ci", "--quiet", "--error",
+                           "--timeout", "30", str(target)], timeout=90)
+        if rc not in (0, 1):  # 1 = findings, accept; >1 = real error
+            detail.append(f"semgrep err: {out[:120]}")
+        elif rc == 1:
+            detail.append(f"semgrep findings: {out.strip().splitlines()[-1][:150]}")
+    # iac scanners
+    if "cf" in str(target).lower() and _have("cfn-guard"):
+        rules = os.environ.get("CFN_GUARD_RULES", "")
+        if rules:
+            rc, out, _ = _run(["cfn-guard", "validate", "-d", str(target), "-r", rules],
+                              timeout=60)
+            if rc != 0:
+                detail.append(f"cfn-guard: {out[:200]}")
+    if not detail:
+        return CheckResult("security", "PASS", "no findings")
+    return CheckResult("security", "FAIL", " | ".join(detail))
+# ── Layer 7: diff sanity ────────────────────────────────────────────────────
+def check_diff(change: str) -> CheckResult:
+    lines = change.splitlines()
+    add = sum(1 for L in lines if L.startswith("+") and not L.startswith("+++"))
+    rem = sum(1 for L in lines if L.startswith("-") and not L.startswith("---"))
+    total = add + rem
+    if total == 0:
+        return CheckResult("diff", "FAIL", "empty diff")
+    if total > MAX_LINES_CHANGED:
+        return CheckResult("diff", "FAIL",
+                           f"{total} lines changed > limit {MAX_LINES_CHANGED}")
+    files_changed = sum(1 for L in lines if L.startswith("+++ b/"))
+    if files_changed > 8:
+        return CheckResult("diff", "FAIL", f"{files_changed} files in one change > 8")
+    return CheckResult("diff", "PASS", f"+{add}/-{rem} lines, {files_changed} files")
+# ── Layer 8: sandbox exec (best-effort) ─────────────────────────────────────
+def check_sandbox(target: Path, kind: str) -> CheckResult:
+    if kind != "shell" or target.suffix != ".sh" or not target.exists():
+        return CheckResult("sandbox", "SKIP", "not a shell script or no target")
+    if not _have("docker"):
+        # Fall back to bash subshell with restricted env, no network
+        rc, out, err = _run(["env", "-i", "PATH=/usr/bin:/bin",
+                             "bash", "-c", f"set -e; bash -n {shlex.quote(str(target))}"],
+                            timeout=10)
+        return CheckResult("sandbox", "PASS" if rc == 0 else "FAIL",
+                           (err or out).strip()[:200] or "ok-no-exec")
+    # docker — run in network=none, read-only, dropped caps
+    rc, out, err = _run([
+        "docker", "run", "--rm", "--network=none", "--read-only",
+        "--cap-drop=ALL", "--memory=256m", "--cpus=0.5",
+        "-v", f"{target}:/script.sh:ro",
+        "alpine:3.20", "sh", "-c", "bash /script.sh --dry-run --help 2>&1 | head -20",
+    ], timeout=30)
+    return CheckResult("sandbox", "PASS" if rc == 0 else "FAIL",
+                       (out or err).strip()[:200] or "ok")
+# ── Layer 9: confidence (with destructive-class escalation) ────────────────
+def check_confidence(conf: float | None, change: str) -> CheckResult:
+    if conf is None:
+        return CheckResult("confidence", "SKIP", "no confidence supplied")
+    floor = CONFIDENCE_FLOOR
+    if _is_destructive(change):
+        floor = max(floor, DESTRUCTIVE_CONFIDENCE_FLOOR)
+        suffix = " (destructive-class)"
+    else:
+        suffix = ""
+    if conf < floor:
+        return CheckResult("confidence", "FAIL",
+                           f"{conf:.2f} below floor {floor}{suffix}")
+    return CheckResult("confidence", "PASS", f"{conf:.2f} ≥ {floor}{suffix}")
+# ── Orchestrator ────────────────────────────────────────────────────────────
+def verify(change: str, target: Path, kind: str, confidence: float | None) -> Verdict:
+    checks = [
+        check_diff(change),                         # 7
+        check_policy(change),                       # 5 — fail-fast hard
+        check_ast(target, kind),                    # 1
+        check_lint(target, kind),                   # 2
+        check_typecheck(target, kind),              # 3
+        check_tests(target, kind),                  # 4
+        check_security(target, change),             # 6
+        check_sandbox(target, kind),                # 8
+        check_confidence(confidence, change),       # 9 (with destructive escalation)
+    ]
+    n_pass = sum(c.passed() for c in checks)
+    n_fail = sum(c.failed() for c in checks)
+    n_skip = sum(c.status == "SKIP" for c in checks)
+    reasons = [f"{c.name}: {c.detail}" for c in checks if c.failed()]
+    n_run = n_pass + n_fail
+    ok = (n_fail == 0) and (n_run >= MIN_VERIFIERS_RUN)
+    if not ok and n_run < MIN_VERIFIERS_RUN:
+        reasons.append(f"only {n_run} verifiers ran (min {MIN_VERIFIERS_RUN}) — install missing tools")
+    return Verdict(ok=ok, reasons=reasons, checks=checks,
+                   n_pass=n_pass, n_fail=n_fail, n_skip=n_skip)
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--change", required=True,
+                   help="path to unified-diff or raw patch text")
+    p.add_argument("--target", required=True,
+                   help="primary file the change applies to")
+    p.add_argument("--kind", required=True, choices=["code", "iac", "sql", "shell"])
+    p.add_argument("--confidence", type=float, default=None,
+                   help="model logprob-derived confidence in [0,1]")
+    p.add_argument("--out", default=None, help="write verdict JSON to this path")
+    args = p.parse_args()
+    change_path = Path(args.change)
+    change_txt = change_path.read_text() if change_path.exists() else args.change
+    verdict = verify(change_txt, Path(args.target), args.kind, args.confidence)
+    j = json.dumps({
+        "ok": verdict.ok,
+        "reasons": verdict.reasons,
+        "n_pass": verdict.n_pass,
+        "n_fail": verdict.n_fail,
+        "n_skip": verdict.n_skip,
+        "checks": [asdict(c) for c in verdict.checks],
+    }, indent=2)
+    print(j)
+    if args.out:
+        Path(args.out).write_text(j)
+    return 0 if verdict.ok else 1
+if __name__ == "__main__":
+    sys.exit(main())

bin/v2/watchdog.sh ADDED Viewed

	@@ -0,0 +1,195 @@

+#!/usr/bin/env bash
+# Surrogate-1 — watchdog (research §autonomous-24x7 pattern 6).
+#
+# Runs as a SEPARATE process from autonomous-sre / autonomous-release. Its
+# only job is to observe outcomes.jsonl + the daemon processes and KILL
+# them if it detects:
+#   - Loop / runaway (same trigger fires N times in M minutes)
+#   - Audit-gap (daemon logs an "applied" outcome but verifier was missing)
+#   - Failure cascade (≥X consecutive rollback/error outcomes)
+#   - Disk fill (state dir > $STATE_GB_LIMIT GB)
+#   - Outcome rate spike (≥X outcomes in 1 min — likely runaway)
+#
+# Watchdog must NEVER apply patches itself (no Surrogate calls, no patch
+# tool). It only observes and kills. Restart of daemons is a human
+# decision after reading the kill reason.
+#
+# Usage (run on a machine separate from the daemons in the hardened setup;
+# for now we run it as a sibling process):
+#   nohup bash bin/v2/watchdog.sh \
+#       > $HOME/.surrogate/logs/watchdog.log 2>&1 &
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+STATE="$HOME/.surrogate/state"
+OUTCOMES="$STATE/outcomes.jsonl"
+LOG="$HOME/.surrogate/logs/watchdog.log"
+KILLED="$STATE/watchdog-killed"
+mkdir -p "$STATE" "$(dirname "$LOG")"
+INTERVAL_SEC="${WD_INTERVAL_SEC:-60}"            # check every minute
+LOOP_THRESHOLD_N="${WD_LOOP_N:-5}"               # same trigger ≥5×
+LOOP_WINDOW_MIN="${WD_LOOP_WIN_MIN:-15}"         # in 15 min
+CASCADE_THRESHOLD="${WD_CASCADE_N:-5}"           # ≥5 consecutive failures
+RATE_SPIKE_PER_MIN="${WD_RATE_SPIKE:-30}"        # ≥30 outcomes/min
+STATE_GB_LIMIT="${WD_STATE_GB:-5}"
+DAEMONS=(
+    "autonomous-sre.sh"
+    "autonomous-release.sh"
+    "auto-swap-and-bench.sh"
+)
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "{\"content\":\"🚨 watchdog: $1\"}" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+kill_daemons() {
+    local reason="$1"
+    log "═══ KILL: $reason ═══"
+    notify "KILL — $reason"
+    : > "$KILLED"; date -u +%Y-%m-%dT%H:%M:%SZ >> "$KILLED"
+    echo "$reason" >> "$KILLED"
+    for d in "${DAEMONS[@]}"; do
+        if pgrep -f "$d" >/dev/null; then
+            log "  pkill -f $d"
+            pkill -f "$d" || true
+        fi
+    done
+    sleep 5
+    for d in "${DAEMONS[@]}"; do
+        if pgrep -f "$d" >/dev/null; then
+            log "  pkill -9 -f $d (still alive)"
+            pkill -9 -f "$d" || true
+        fi
+    done
+}
+# Detect: same trigger fires N times in M minutes
+check_loop() {
+    [[ ! -f "$OUTCOMES" ]] && return 0
+    python3 - <<PYEOF
+import json, datetime as dt, collections, sys
+cutoff = dt.datetime.now(dt.timezone.utc) - dt.timedelta(minutes=$LOOP_WINDOW_MIN)
+recent = collections.Counter()
+for L in open("$OUTCOMES"):
+    try: r = json.loads(L)
+    except: continue
+    try:
+        ts = dt.datetime.strptime(r["ts"], "%Y-%m-%dT%H:%M:%SZ")
+    except: continue
+    if ts < cutoff: continue
+    recent[r.get("trigger","?")] += 1
+for trig, n in recent.items():
+    if n >= $LOOP_THRESHOLD_N:
+        sys.exit(11)   # loop detected
+sys.exit(0)
+PYEOF
+    return $?
+}
+# Detect: ≥X consecutive non-success outcomes
+check_cascade() {
+    [[ ! -f "$OUTCOMES" ]] && return 0
+    python3 - <<PYEOF
+import json, sys
+streak = 0
+recent = []
+for L in open("$OUTCOMES"):
+    try: r = json.loads(L)
+    except: continue
+    recent.append(r)
+recent = recent[-$CASCADE_THRESHOLD:]
+if len(recent) < $CASCADE_THRESHOLD:
+    sys.exit(0)
+if all(r.get("outcome") in ("rollback","error") for r in recent):
+    sys.exit(12)
+sys.exit(0)
+PYEOF
+    return $?
+}
+# Detect: outcome rate spike (>X in last minute)
+check_rate_spike() {
+    [[ ! -f "$OUTCOMES" ]] && return 0
+    python3 - <<PYEOF
+import json, datetime as dt, sys
+cutoff = dt.datetime.now(dt.timezone.utc) - dt.timedelta(minutes=1)
+n = 0
+for L in open("$OUTCOMES"):
+    try: r = json.loads(L)
+    except: continue
+    try:
+        ts = dt.datetime.strptime(r["ts"], "%Y-%m-%dT%H:%M:%SZ")
+    except: continue
+    if ts >= cutoff: n += 1
+if n >= $RATE_SPIKE_PER_MIN:
+    sys.exit(13)
+sys.exit(0)
+PYEOF
+    return $?
+}
+# Detect: applied without a verdict (audit gap)
+check_audit_gap() {
+    [[ ! -f "$OUTCOMES" ]] && return 0
+    python3 - <<'PYEOF'
+import json, sys, os
+gaps = 0
+with open(os.environ["OUTCOMES"]) as f:
+    for L in f.readlines()[-50:]:
+        try: r = json.loads(L)
+        except: continue
+        if r.get("applied") and not r.get("verdict"):
+            gaps += 1
+if gaps >= 3:
+    sys.exit(14)
+sys.exit(0)
+PYEOF
+    return $?
+}
+export OUTCOMES
+# Detect: state dir disk fill
+check_disk() {
+    local kb gb
+    kb=$(du -sk "$STATE" 2>/dev/null | awk '{print $1}')
+    gb=$(( kb / 1048576 ))
+    if (( gb > STATE_GB_LIMIT )); then
+        log "state dir = ${gb}GB > limit ${STATE_GB_LIMIT}GB"
+        return 15
+    fi
+    return 0
+}
+log "═══ watchdog starting (interval=${INTERVAL_SEC}s) ═══"
+notify "watchdog online"
+while true; do
+    if [[ -f "$KILLED" ]]; then
+        log "killed marker present — staying dormant. Remove $KILLED to re-arm."
+        sleep "$INTERVAL_SEC"
+        continue
+    fi
+    rc=0
+    check_loop       || rc=$?
+    [[ $rc -eq 11 ]] && { kill_daemons "LOOP detected (≥$LOOP_THRESHOLD_N same trigger in ${LOOP_WINDOW_MIN}m)"; continue; }
+    check_cascade    || rc=$?
+    [[ $rc -eq 12 ]] && { kill_daemons "CASCADE: $CASCADE_THRESHOLD consecutive rollback/error"; continue; }
+    check_rate_spike || rc=$?
+    [[ $rc -eq 13 ]] && { kill_daemons "RATE SPIKE: ≥$RATE_SPIKE_PER_MIN outcomes in 60s"; continue; }
+    check_audit_gap  || rc=$?
+    [[ $rc -eq 14 ]] && { kill_daemons "AUDIT GAP: ≥3 applied actions without verdict"; continue; }
+    check_disk       || rc=$?
+    [[ $rc -eq 15 ]] && { kill_daemons "DISK: $STATE >${STATE_GB_LIMIT}GB"; continue; }
+    sleep "$INTERVAL_SEC"
+done