Ashira Pitchayapakayakul commited on
Commit
4e166c6
Β·
1 Parent(s): dddf626

feat(v8+autonomy): research-driven trainer + 4 daemons + 9-layer safety gate

Browse files

Synthesised from 4 parallel research streams (~1.6k lines of dense notes
in knowledge/trends-2026/) into one shippable change.

V8 trainer (kaggle-trainer.sh) β€” 5 research-grounded additions:
β€’ PiSSA SVD init (replaces LoftQ default; LoftQ kept as fallback via
SUR_LORA_INIT=loftq) β€” Meng '24, +1-3pp on code benchmarks.
β€’ LoRA+ optimizer with lr_B = 16Β·lr_A β€” Hayou '24, free +1-2pp,
via peft.optimizers.create_loraplus_optimizer with manual-split
fallback for older peft.
β€’ V8 dataset blend via merge_external() β€” ToolACE 1.5Γ—, Multi-IaC-Eval
2Γ—, xLAM-fn-call-60k 1Γ—, ITBench-Trajectories 2Γ—, Code-Feedback 1Γ—.
Each take/weight env-tunable; format-tolerant via extract_pair().
β€’ GRPO Phase-2 scaffold (RUN_GRPO=1) β€” DeepSeekMath/RLVR-Code, post-SFT
booster with execution-pass reward function. Disabled by default
(needs TRL β‰₯0.12 + β‰₯30GB VRAM headroom).
β€’ Hub bumped: axentx/surrogate-1-7B-v1.2-research.

Autonomous daemons (4) β€” all share verifier-ensemble.py + outcome-log.py:
β€’ autonomous-sre.sh β€” 5-min sweep: HF Space stages, dataset staleness,
ZeroGPU smoke, GH Action failure rate, outcome-log self-health. On
anomaly: build prompt β†’ call Surrogate β†’ idempotency dedupe β†’ 9-layer
verifier β†’ apply OR queue. Whitelisted scope: only systems Surrogate
owns (no AWS/prod).
β€’ autonomous-release.sh β€” hourly recon: HN+GH-trending+ProductHunt
cluster by owner-relevant keywords, build SDD spec, generate 3 patch
candidates with CISC self-consistency voting (research Β§autonomous-24x7
pattern 1), pick best by verifier+confidence, open draft PR via gh.
β€’ self-improve.sh β€” daily/weekly flywheel: outcomes.jsonl β†’ SFT replay
(success-only, RLEF-aligned), KTO unpaired (every label, lossless
on logs), skill library (verified procedures by trigger). Pushes to
axentx/surrogate-1-{self-traces,pref-kto,skills}; flags next training
when SFTβ‰₯200 or KTOβ‰₯500.
β€’ watchdog.sh β€” independent observer with kill-switch. Detects loops
(β‰₯5 same trigger in 15m), failure cascades (β‰₯5 consecutive non-success),
rate spikes (β‰₯30/min), audit gaps (applied without verdict), disk
fill. Never calls Surrogate, never applies; only kills + records.

Safety gate (verifier-ensemble.py) β€” single source of truth, 9 layers:
ast / lint (ruff/shellcheck/cfn-lint/tflint) / typecheck / tests /
policy (14-rule HardGuard list β€” terraform destroy, kubectl delete ns
prod-*, IAM Allow*:*, ec2 terminate w/o dry-run, rds delete w/o final
snapshot, helm install w/o digest pin, AKIA/private-key/sk-/hf_ leaks,
MFA bypass, force-push to main, etc) / security (gitleaks+semgrep+
cfn-guard) / diff sanity (≀300 lines, ≀8 files) / sandbox (docker
--network=none --read-only --cap-drop=ALL) / confidence (β‰₯0.95 floor
for destructive-class actions). All non-SKIP must PASS, β‰₯3 verifiers
must run.

Helpers: surrogate-call.py (strict-JSON LLM call w/ retries + schema
validation for diagnosis|spec|patch), outcome-log.py (append-only JSONL),
idempotency.py (sha256(plan) ledger w/ TTL β€” prevents replay storms when
same anomaly fires twice).

Bench (bench-v1-vs-v15.sh): added 4th model (v1.2-research) + 2 new evals
(Multi-IaC-Eval CFN/TF/CDK pass-rate, ITBench-lite K8s SRE scenarios).
Now 4-way Γ— 9 evals.

Architecture map: knowledge/surrogate-1-autonomous-arch.md β€” single
on-ramp doc with all components, file paths, run/disarm commands, the
14 HardGuards, and the V9 stretch ladder.

V7 train.py at ~/Desktop/surrogate-1-train-v7-7B-extended-plus.py is now
superseded by ~/Desktop/surrogate-1-train-v8-research.py. User uploads
the V8 file via Kaggle UI Replace File β†’ Save Version when ready.

bin/kaggle-trainer.sh CHANGED
@@ -206,8 +206,8 @@ EPOCHS = float(os.environ.get("EPOCHS", "1"))
206
  _default_hub = {
207
  32.0: "axentx/surrogate-1-coder-32B-v1.5",
208
  14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
209
- 7.0: "axentx/surrogate-1-7B-v1.1-extended", # ← T4Γ—2 validation target
210
- }.get(_auto_size, "axentx/surrogate-1-7B-v1.1-extended")
211
  HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
212
  # seq_len auto-shrinks for smaller hardware budget
213
  _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
@@ -339,6 +339,44 @@ try:
339
  except Exception as e:
340
  print(f" βœ— Magpie skip (repo not yet published): {type(e).__name__}: {str(e)[:80]}")
341
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
342
  raw = Dataset.from_list(rows)
343
  # (Active-learning teachable filter applied AFTER model load β€” see below.
344
  # Filtering needs the 4-bit base model to score perplexity, which doesn't
@@ -445,21 +483,28 @@ lora_kwargs = dict(
445
  use_dora=True, # R2: DoRA
446
  task_type="CAUSAL_LM",
447
  )
448
- # RSLoRA + LoftQ require recent peft versions β€” fall back gracefully
 
 
 
449
  try:
450
  from peft import LoraConfig as _Probe
451
  import inspect
452
  _sig = inspect.signature(_Probe).parameters
453
  if "use_rslora" in _sig: lora_kwargs["use_rslora"] = True
454
  if "init_lora_weights" in _sig:
455
- try:
456
- from peft import LoftQConfig
457
- lora_kwargs["init_lora_weights"] = "loftq"
458
- lora_kwargs["loftq_config"] = LoftQConfig(loftq_bits=4, loftq_iter=5)
459
- except Exception:
460
- pass
461
- except Exception:
462
- pass
 
 
 
 
463
  print(f" LoRA config: r={LORA_R}, DoRA={lora_kwargs.get('use_dora')}, "
464
  f"RSLoRA={lora_kwargs.get('use_rslora', False)}, "
465
  f"init={lora_kwargs.get('init_lora_weights', 'gaussian')}, "
@@ -469,6 +514,44 @@ lora = LoraConfig(**lora_kwargs)
469
  model = get_peft_model(model, lora)
470
  model.print_trainable_parameters()
471
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
472
  # ── Format chat template (system + user + assistant) ────────────────────────
473
  def fmt(ex):
474
  msgs = [
@@ -522,12 +605,17 @@ sft_cfg = SFTConfig(
522
  report_to="none",
523
  )
524
 
525
- trainer = SFTTrainer(
526
  model=model,
527
  args=sft_cfg,
528
  train_dataset=raw,
529
  tokenizer=tok,
530
  )
 
 
 
 
 
531
 
532
  print()
533
  print("━━━ training start ━━━")
@@ -536,10 +624,65 @@ print("━━━ training done ━━━")
536
 
537
  # Final push (in case last save_steps didn't trigger)
538
  trainer.push_to_hub(commit_message=(
539
- f"Surrogate-1 v1.5 SFT β€” base={BASE.split('/')[-1]}, "
540
- f"r=32+DoRA, NEFTune Ξ±=5, seq={SEQ_LEN}, "
 
541
  f"{len(rows):,} samples Γ— {EPOCHS} epochs (Kaggle T4Γ—2)"))
542
  print("βœ… pushed to", HUB_ID)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
543
  PYEOF
544
 
545
  # ── Push notebook to Kaggle (creates if not exists, updates if exists) ─────
 
206
  _default_hub = {
207
  32.0: "axentx/surrogate-1-coder-32B-v1.5",
208
  14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
209
+ 7.0: "axentx/surrogate-1-7B-v1.2-research", # ← V8: research-driven stack
210
+ }.get(_auto_size, "axentx/surrogate-1-7B-v1.2-research")
211
  HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
212
  # seq_len auto-shrinks for smaller hardware budget
213
  _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
 
339
  except Exception as e:
340
  print(f" βœ— Magpie skip (repo not yet published): {type(e).__name__}: {str(e)[:80]}")
341
 
342
+ # ── V8 RESEARCH-DRIVEN DATASET BLEND ────────────────────────────────────────
343
+ # From research Β§devsecops-sre-agentic.md (top-5 datasets) + Β§coding-llm-frontier
344
+ # (#5 Code-Feedback). Each blend is opt-in via env knob (default ON).
345
+ # Format-tolerant extract_pair() handles ShareGPT, instruction/output, etc.
346
+ def merge_external(repo: str, take: int, weight: float, name: str):
347
+ """Stream-and-merge a HF dataset with weight oversampling."""
348
+ if take <= 0:
349
+ print(f" - {name}: disabled (take=0)")
350
+ return 0
351
+ try:
352
+ # Many of these datasets are gated; use HF_TOKEN automatically
353
+ ds = load_dataset(repo, split="train", streaming=True)
354
+ n = 0
355
+ replicate = max(1, int(round(weight)))
356
+ for ex in ds:
357
+ if n >= take: break
358
+ pair = extract_pair(ex)
359
+ if not pair: continue
360
+ p, r = pair
361
+ for _ in range(replicate):
362
+ rows.append({"prompt": p, "response": r})
363
+ n += 1
364
+ print(f" + {name}: {n:,} pairs Γ— {replicate} = {n*replicate:,} rows merged")
365
+ return n
366
+ except Exception as e:
367
+ msg = f"{type(e).__name__}: {str(e)[:90]}"
368
+ print(f" βœ— {name} skip ({repo}): {msg}")
369
+ return 0
370
+
371
+ # Research-recommended weights β€” see knowledge/trends-2026/devsecops-sre-agentic.md
372
+ merge_external("Team-ACE/ToolACE", int(os.environ.get("TAKE_TOOLACE", "8000")), 1.5, "ToolACE")
373
+ merge_external("AmazonScience/Multi-IaC-Eval", int(os.environ.get("TAKE_MULTIIAC", "5000")), 2.0, "Multi-IaC-Eval")
374
+ merge_external("Salesforce/xlam-function-calling-60k", int(os.environ.get("TAKE_XLAM", "10000")), 1.0, "xLAM-fn-call-60k")
375
+ merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH", "3000")), 2.0, "ITBench-Trajectories")
376
+ merge_external("m-a-p/Code-Feedback", int(os.environ.get("TAKE_CODEFB", "8000")), 1.0, "Code-Feedback")
377
+
378
+ print(f" total rows after V8 blend: {len(rows):,}")
379
+
380
  raw = Dataset.from_list(rows)
381
  # (Active-learning teachable filter applied AFTER model load β€” see below.
382
  # Filtering needs the 4-bit base model to score perplexity, which doesn't
 
483
  use_dora=True, # R2: DoRA
484
  task_type="CAUSAL_LM",
485
  )
486
+ # V8: PiSSA init by default (research Β§coding-llm-frontier #4) β€” SVD of base
487
+ # weights gives a much better starting point than gaussian. LoftQ/gaussian
488
+ # remain as env-controlled fallback for A/B comparison.
489
+ LORA_INIT = os.environ.get("SUR_LORA_INIT", "pissa_niter_4")
490
  try:
491
  from peft import LoraConfig as _Probe
492
  import inspect
493
  _sig = inspect.signature(_Probe).parameters
494
  if "use_rslora" in _sig: lora_kwargs["use_rslora"] = True
495
  if "init_lora_weights" in _sig:
496
+ if LORA_INIT.startswith("pissa"):
497
+ lora_kwargs["init_lora_weights"] = LORA_INIT # "pissa" or "pissa_niter_K"
498
+ elif LORA_INIT == "loftq":
499
+ try:
500
+ from peft import LoftQConfig
501
+ lora_kwargs["init_lora_weights"] = "loftq"
502
+ lora_kwargs["loftq_config"] = LoftQConfig(loftq_bits=4, loftq_iter=5)
503
+ except Exception as e:
504
+ print(f" ⚠ LoftQ unavailable, falling back to gaussian: {e}")
505
+ # else: gaussian default
506
+ except Exception as e:
507
+ print(f" ⚠ LoRA config probe failed: {e}")
508
  print(f" LoRA config: r={LORA_R}, DoRA={lora_kwargs.get('use_dora')}, "
509
  f"RSLoRA={lora_kwargs.get('use_rslora', False)}, "
510
  f"init={lora_kwargs.get('init_lora_weights', 'gaussian')}, "
 
514
  model = get_peft_model(model, lora)
515
  model.print_trainable_parameters()
516
 
517
+ # ── V8: LoRA+ optimizer (research Β§coding-llm-frontier #3) ──────────────────
518
+ # Hayou et al 2024 (arxiv 2402.12354): the B matrix in LoRA needs a learning
519
+ # rate ~16Γ— higher than A for fastest convergence + +1-2pp benchmark lift.
520
+ # Free improvement β€” no extra memory cost. Activated via SUR_LORA_PLUS_RATIO.
521
+ LORA_PLUS_RATIO = float(os.environ.get("SUR_LORA_PLUS_RATIO", "16"))
522
+ LORA_PLUS_OPT = None # set later if available
523
+ if LORA_PLUS_RATIO > 1.0:
524
+ try:
525
+ # peft.optimizers.create_loraplus_optimizer is the canonical helper
526
+ # (peft>=0.13). For older peft we fall back to manual param-group split.
527
+ from peft.optimizers import create_loraplus_optimizer # type: ignore
528
+ import bitsandbytes as bnb_lib
529
+ LORA_PLUS_OPT = create_loraplus_optimizer(
530
+ model=model,
531
+ optimizer_cls=bnb_lib.optim.PagedAdamW8bit,
532
+ lr=float(os.environ.get("LEARNING_RATE", "7e-5")),
533
+ loraplus_lr_ratio=LORA_PLUS_RATIO,
534
+ weight_decay=0.01,
535
+ )
536
+ print(f" LoRA+ optimizer: lr_B/lr_A = {LORA_PLUS_RATIO}x (paged AdamW 8-bit)")
537
+ except Exception as e:
538
+ print(f" ⚠ LoRA+ helper unavailable ({type(e).__name__}: {e}) β€” manual split")
539
+ try:
540
+ import bitsandbytes as bnb_lib
541
+ param_groups = [
542
+ {"params": [p for n, p in model.named_parameters()
543
+ if "lora_A" in n], "lr": float(os.environ.get("LEARNING_RATE", "7e-5"))},
544
+ {"params": [p for n, p in model.named_parameters()
545
+ if "lora_B" in n], "lr": float(os.environ.get("LEARNING_RATE", "7e-5")) * LORA_PLUS_RATIO},
546
+ ]
547
+ LORA_PLUS_OPT = bnb_lib.optim.PagedAdamW8bit(param_groups, weight_decay=0.01)
548
+ print(f" LoRA+ manual split: lr_B/lr_A = {LORA_PLUS_RATIO}x")
549
+ except Exception as e2:
550
+ print(f" ⚠ LoRA+ manual split also failed ({e2}) β€” using SFTTrainer default optim")
551
+ LORA_PLUS_OPT = None
552
+ else:
553
+ print(" LoRA+ disabled (SUR_LORA_PLUS_RATIO ≀ 1.0)")
554
+
555
  # ── Format chat template (system + user + assistant) ────────────────────────
556
  def fmt(ex):
557
  msgs = [
 
605
  report_to="none",
606
  )
607
 
608
+ trainer_kwargs = dict(
609
  model=model,
610
  args=sft_cfg,
611
  train_dataset=raw,
612
  tokenizer=tok,
613
  )
614
+ if LORA_PLUS_OPT is not None:
615
+ # Pass tuple (optimizer, lr_scheduler=None) so HF Trainer doesn't rebuild
616
+ trainer_kwargs["optimizers"] = (LORA_PLUS_OPT, None)
617
+
618
+ trainer = SFTTrainer(**trainer_kwargs)
619
 
620
  print()
621
  print("━━━ training start ━━━")
 
624
 
625
  # Final push (in case last save_steps didn't trigger)
626
  trainer.push_to_hub(commit_message=(
627
+ f"Surrogate-1 v1.2-research SFT β€” base={BASE.split('/')[-1]}, "
628
+ f"r={LORA_R}+DoRA+RSLoRA+{lora_kwargs.get('init_lora_weights','gauss')}, "
629
+ f"LoRA+x{LORA_PLUS_RATIO} NEFTune Ξ±=5 seq={SEQ_LEN}, "
630
  f"{len(rows):,} samples Γ— {EPOCHS} epochs (Kaggle T4Γ—2)"))
631
  print("βœ… pushed to", HUB_ID)
632
+
633
+ # ── V8 GRPO Phase-2 hook (scaffold only β€” disabled by default) ─────────────
634
+ # Research Β§coding-llm-frontier pick #1: post-SFT GRPO with execution-based
635
+ # rewards is the BIGGEST single lift (+5-9pp LCB v6, +4-7pp HumanEval+).
636
+ # Implementing the RL loop here would require a Python sandbox + unit-test
637
+ # generator + group-of-N rollouts, all of which strain T4Γ—2. Scaffolded but
638
+ # gated behind RUN_GRPO=1 + TRL>=0.12 + β‰₯30GB peak VRAM headroom.
639
+ if os.environ.get("RUN_GRPO", "0") == "1":
640
+ try:
641
+ from trl import GRPOTrainer, GRPOConfig # type: ignore
642
+ print("━━━ Phase 2: GRPO with execution rewards (experimental) ━━━")
643
+ # Reward fn: run candidate code in subprocess, +1 if all unit tests
644
+ # pass, 0 otherwise. Group-of-4 rollouts per prompt.
645
+ import re, subprocess, tempfile, signal
646
+ def reward_unit_test_pass(prompts, completions, **kw):
647
+ rewards = []
648
+ for c in completions:
649
+ # Extract first ```python ... ``` block
650
+ m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
651
+ code = m.group(1) if m else c
652
+ with tempfile.NamedTemporaryFile("w", suffix=".py",
653
+ delete=False) as f:
654
+ f.write(code); pth = f.name
655
+ try:
656
+ rc = subprocess.run(
657
+ ["python", "-c", f"exec(open('{pth}').read())"],
658
+ timeout=8, capture_output=True
659
+ ).returncode
660
+ rewards.append(1.0 if rc == 0 else 0.0)
661
+ except Exception:
662
+ rewards.append(0.0)
663
+ return rewards
664
+ grpo_cfg = GRPOConfig(
665
+ output_dir="./surrogate-1-v1.2-research-grpo",
666
+ num_generations=4, learning_rate=5e-7,
667
+ num_train_epochs=1, per_device_train_batch_size=1,
668
+ gradient_accumulation_steps=8,
669
+ bf16=BF16_OK, fp16=not BF16_OK,
670
+ push_to_hub=True, hub_model_id=HUB_ID + "-grpo",
671
+ hub_token=os.environ.get("HF_TOKEN"),
672
+ )
673
+ grpo = GRPOTrainer(
674
+ model=model, args=grpo_cfg,
675
+ reward_funcs=[reward_unit_test_pass],
676
+ train_dataset=raw,
677
+ )
678
+ grpo.train()
679
+ grpo.push_to_hub(commit_message=f"Surrogate-1 v1.2-research GRPO Phase-2")
680
+ print("βœ… GRPO Phase-2 done")
681
+ except ImportError as e:
682
+ print(f" GRPO scaffold skipped β€” TRL too old: {e}")
683
+ except Exception as e:
684
+ print(f" ⚠ GRPO Phase-2 failed: {type(e).__name__}: {e}")
685
+ print(" (SFT checkpoint is still saved β€” GRPO is post-SFT booster)")
686
  PYEOF
687
 
688
  # ── Push notebook to Kaggle (creates if not exists, updates if exists) ─────
bin/v2/autonomous-release.sh ADDED
@@ -0,0 +1,425 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” autonomous release daemon.
3
+ #
4
+ # Loop:
5
+ # 1. Recon β€” pull HN top, GitHub trending, ProductHunt feed, RSS list.
6
+ # 2. Cluster signals into "feature gaps" (frequency Γ— recency Γ— relevance).
7
+ # 3. For each above-threshold gap, ask Surrogate-1 to write a spec.md
8
+ # (problem / user-stories / acceptance / impact / out-of-scope).
9
+ # 4. Ask Surrogate-1 for an implementation patch + tests.
10
+ # 5. CISC self-consistency: generate 3 patch candidates, take the one
11
+ # that passes verifier-ensemble + has highest test pass rate.
12
+ # 6. If verdict ok β†’ open a draft PR in target repo, run CI in canary
13
+ # branch with metric-gated promotion (Flagger-style if available).
14
+ # 7. Auto-rollback on SLO violation; auto-promote if green for COOLDOWN.
15
+ # 8. Outcome β†’ outcomes.jsonl for self-improve.
16
+ #
17
+ # Owner-controlled scope (only repos this daemon may touch):
18
+ # AUTO_RELEASE_REPOS env (comma-separated), default = axentx/surrogate-1
19
+ #
20
+ # Hard guards:
21
+ # - Never push to main; always open PR (draft) on a auto/* branch
22
+ # - Diff ≀ 600 lines, ≀ 12 files, must include tests
23
+ # - All HardGuards from verifier-ensemble.py apply
24
+ # - PR labeled "autonomous-release" + linked to outcome record id
25
+ #
26
+ # Usage:
27
+ # nohup bash bin/v2/autonomous-release.sh \
28
+ # > $HOME/.surrogate/logs/autonomous-release.log 2>&1 &
29
+ #
30
+ # Cron once per hour:
31
+ # 0 * * * * bash $HOME/.surrogate/hf-space/bin/v2/autonomous-release.sh --once
32
+ set -uo pipefail
33
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
34
+
35
+ HFB="$HOME/.surrogate/hf-space/bin/v2"
36
+ STATE="$HOME/.surrogate/state"
37
+ SPECS="$STATE/specs"
38
+ LOG="$HOME/.surrogate/logs/autonomous-release.log"
39
+ mkdir -p "$STATE" "$SPECS" "$(dirname "$LOG")"
40
+
41
+ ONCE=0
42
+ [[ "${1:-}" == "--once" ]] && ONCE=1
43
+
44
+ INTERVAL_SEC="${REL_INTERVAL_SEC:-3600}" # 1 h
45
+ SPACE="${REL_SPACE:-surrogate1/surrogate-1-zero-gpu}"
46
+ REPOS=(${AUTO_RELEASE_REPOS:-axentx/surrogate-1})
47
+ RECON_LIMIT="${REL_RECON_LIMIT:-50}"
48
+ CISC_N="${REL_CISC_N:-3}"
49
+ GAP_FREQ_THRESHOLD="${REL_GAP_FREQ:-3}" # signal must appear β‰₯3 sources
50
+
51
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
52
+ notify() {
53
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
54
+ curl -s -X POST -H "Content-Type: application/json" \
55
+ -d "{\"content\":\"πŸš€ autonomous-release: $1\"}" \
56
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
57
+ }
58
+
59
+ # ── Recon: pull signals from public sources ─────────────────────────────────
60
+ recon() {
61
+ local out="$1"
62
+ log " recon β†’ $out"
63
+ : > "$out"
64
+
65
+ # HN top stories β€” Algolia public API, no auth needed
66
+ curl -fsS --max-time 15 \
67
+ "https://hn.algolia.com/api/v1/search?tags=story&numericFilters=points>50&hitsPerPage=$RECON_LIMIT" \
68
+ 2>/dev/null | python3 -c "
69
+ import json, sys
70
+ try: d = json.load(sys.stdin)
71
+ except: sys.exit(0)
72
+ for h in d.get('hits', []):
73
+ print(json.dumps({'src':'hn','title':h.get('title',''),'url':h.get('url',''),
74
+ 'score':h.get('points',0),'ts':h.get('created_at','')}))
75
+ " 2>/dev/null >> "$out"
76
+
77
+ # GitHub trending β€” no official API, scrape via /trending
78
+ curl -fsS --max-time 20 \
79
+ "https://github.com/trending?since=daily&spoken_language_code=en" 2>/dev/null \
80
+ | python3 -c "
81
+ import sys, re, json
82
+ html = sys.stdin.read()
83
+ # very light extractor β€” avoid pulling beautifulsoup just for this
84
+ for m in re.finditer(r'<h2 class=\"h3 lh-condensed\">\s*<a href=\"([^\"]+)\"', html):
85
+ repo = m.group(1).lstrip('/')
86
+ print(json.dumps({'src':'gh-trending','title':repo,'url':'https://github.com/'+repo,
87
+ 'score':1,'ts':''}))
88
+ " 2>/dev/null | head -n 30 >> "$out"
89
+
90
+ # ProductHunt β€” public RSS-ish endpoint
91
+ curl -fsS --max-time 15 \
92
+ "https://www.producthunt.com/feed" 2>/dev/null \
93
+ | python3 -c "
94
+ import sys, re, json
95
+ xml = sys.stdin.read()
96
+ for m in re.finditer(r'<title>([^<]+)</title>\s*<link>([^<]+)</link>', xml)[:30] or []:
97
+ print(json.dumps({'src':'producthunt','title':m.group(1),'url':m.group(2),'score':1,'ts':''}))
98
+ " 2>/dev/null >> "$out" || true
99
+
100
+ local n; n=$(wc -l < "$out" | tr -d ' ')
101
+ log " collected $n signals"
102
+ }
103
+
104
+ # ── Gap analysis: cluster signals by keyword overlap ────────────────────────
105
+ gap_analysis() {
106
+ local recon_in="$1" gaps_out="$2"
107
+ python3 - <<PYEOF
108
+ import json, re, collections
109
+ from pathlib import Path
110
+
111
+ # Owner-relevant keywords β€” bias the funnel toward what Surrogate-1 cares about
112
+ OWNER_KW = {
113
+ "agent","agentic","autonomous","llm","fine-tune","lora","peft",
114
+ "dpo","grpo","rlhf","rlaif","sft","quantization","bitsandbytes",
115
+ "vllm","sglang","tgi","inference","kubernetes","k8s","helm",
116
+ "terraform","cloudformation","aws","prowler","cspm","sre",
117
+ "incident","oncall","postmortem","observability","prometheus",
118
+ "opentelemetry","loki","grafana","argo","gitops","cicd",
119
+ "security","cve","cwe","sbom","slsa","supply-chain","gitleaks",
120
+ "semgrep","sast","dast","mcp","computer-use","tool-use","agent-bench"
121
+ }
122
+ sigs = []
123
+ for L in open("$recon_in"):
124
+ try: sigs.append(json.loads(L))
125
+ except: pass
126
+
127
+ # tokenize titles, score by owner-kw overlap
128
+ def toks(s):
129
+ return set(t.lower() for t in re.findall(r"[a-zA-Z][a-zA-Z0-9-]+", s or ""))
130
+
131
+ clusters = collections.defaultdict(list)
132
+ for s in sigs:
133
+ t = toks(s.get("title", ""))
134
+ overlap = t & OWNER_KW
135
+ if not overlap:
136
+ continue
137
+ # bucket by sorted overlap as cluster key
138
+ key = "+".join(sorted(overlap)[:3])
139
+ clusters[key].append(s)
140
+
141
+ gaps = []
142
+ for key, items in clusters.items():
143
+ n_sources = len({i["src"] for i in items})
144
+ if n_sources >= $GAP_FREQ_THRESHOLD or len(items) >= 5:
145
+ gaps.append({
146
+ "topic": key,
147
+ "n_signals": len(items),
148
+ "n_sources": n_sources,
149
+ "examples": [{"title": i["title"], "url": i["url"]} for i in items[:5]],
150
+ })
151
+
152
+ gaps.sort(key=lambda g: (g["n_sources"], g["n_signals"]), reverse=True)
153
+ gaps = gaps[:5] # cap at top 5 per cycle
154
+
155
+ with open("$gaps_out", "w") as f:
156
+ json.dump(gaps, f, indent=2)
157
+
158
+ print(f" β†’ {len(gaps)} gaps identified")
159
+ PYEOF
160
+ }
161
+
162
+ # ── Build spec.md from a gap ────────────────────────────────────────────────
163
+ build_spec() {
164
+ local gap_json="$1" spec_out="$2"
165
+ local work; work=$(mktemp -d)
166
+ cat > "$work/prompt.md" <<EOF
167
+ You are Surrogate-1 in autonomous-release mode. A market signal cluster has
168
+ crossed threshold. Synthesize a Spec-Driven-Development spec for ONE
169
+ feature Surrogate-1 itself should ship β€” must be a small self-improvement
170
+ to the Surrogate-1 platform (training scripts, daemons, evals, dataset
171
+ quality tooling, etc.). Out of scope: external customer features, anything
172
+ needing payment/PII/user data.
173
+
174
+ Signal cluster:
175
+ \`\`\`json
176
+ $(cat "$gap_json")
177
+ \`\`\`
178
+
179
+ Owner constraints:
180
+ - Diff target ≀600 lines / ≀12 files
181
+ - Must include tests
182
+ - Must benefit at least one of: HumanEval+/MBPP+/LCB v6/SWE-Bench/BFCL/axentx-eval-50
183
+ OR the autonomous-{sre,release,improve} daemons.
184
+ - Must be reversible (rollback step required)
185
+
186
+ Output ONLY this JSON schema:
187
+ {
188
+ "title": "<3-7 word feature name>",
189
+ "problem": "<paragraph: what's missing today>",
190
+ "user_stories": ["As Surrogate-1, I want X so that Y", ...],
191
+ "acceptance_criteria": ["Bench score Z improves by β‰₯N%", ...],
192
+ "impact": "<expected metric uplift, citable>",
193
+ "competitors_observed": "<who is doing this elsewhere β€” from signal cluster>",
194
+ "out_of_scope": ["...","..."],
195
+ "rollout_plan": "<canary β†’ promote, with SLO gate>",
196
+ "confidence": 0.0-1.0
197
+ }
198
+ EOF
199
+ python3 "$HFB/surrogate-call.py" --space "$SPACE" \
200
+ --prompt-file "$work/prompt.md" --schema spec \
201
+ --max-tokens 1500 --temperature 0.3 --out "$spec_out"
202
+ local rc=$?
203
+ rm -rf "$work"
204
+ return $rc
205
+ }
206
+
207
+ # ── Build patch candidates with CISC self-consistency ───────────────────────
208
+ build_patch_cisc() {
209
+ local spec_path="$1" out_dir="$2"
210
+ mkdir -p "$out_dir"
211
+ local prompt; prompt=$(mktemp)
212
+ cat > "$prompt" <<EOF
213
+ You are Surrogate-1. Implement the following spec. Produce a unified diff
214
+ + test file. Diff must apply cleanly via \`patch -p1\`.
215
+
216
+ Spec:
217
+ \`\`\`json
218
+ $(cat "$spec_path")
219
+ \`\`\`
220
+
221
+ Hard rules:
222
+ - Modify only files under \$HOME/.surrogate/hf-space/ or under axentx
223
+ repos cloned into \$HOME/develope/.
224
+ - Include or extend tests under tests/v2/ matching the changed file.
225
+ - No new top-level dependency without justification in the diff.
226
+ - Diff under 600 lines / 12 files.
227
+
228
+ Output ONLY this JSON schema:
229
+ {
230
+ "target_file": "<primary file path>",
231
+ "kind": "code"|"iac"|"shell",
232
+ "patch": "<unified diff text>",
233
+ "test_plan": "<commands to verify post-apply>",
234
+ "rollback": "<git revert <sha> or patch -R>",
235
+ "confidence": 0.0-1.0
236
+ }
237
+ EOF
238
+ for i in $(seq 1 $CISC_N); do
239
+ log " CISC candidate $i/$CISC_N"
240
+ # vary temperature for diversity
241
+ local T; T=$(python3 -c "print(round(0.2 + 0.15*$i, 2))")
242
+ python3 "$HFB/surrogate-call.py" --space "$SPACE" \
243
+ --prompt-file "$prompt" --schema patch \
244
+ --max-tokens 2000 --temperature "$T" \
245
+ --out "$out_dir/cand-$i.json" 2>>"$LOG" || \
246
+ log " cand-$i failed (continuing)"
247
+ done
248
+ rm -f "$prompt"
249
+ ls "$out_dir"/cand-*.json 2>/dev/null | wc -l | tr -d ' '
250
+ }
251
+
252
+ # ── Vote: pick best candidate by verifier verdict + confidence ──────────────
253
+ pick_winner() {
254
+ local cand_dir="$1" winner_out="$2"
255
+ local best="" best_score=-1
256
+ for c in "$cand_dir"/cand-*.json; do
257
+ [[ -f "$c" ]] || continue
258
+ local target patch kind conf
259
+ target=$(python3 -c "import json; print(json.load(open('$c')).get('target_file',''))")
260
+ kind=$(python3 -c "import json; print(json.load(open('$c')).get('kind','code'))")
261
+ conf=$(python3 -c "import json; print(json.load(open('$c')).get('confidence',0))")
262
+ python3 -c "import json,sys; sys.stdout.write(json.load(open('$c')).get('patch',''))" > "$cand_dir/$(basename "$c" .json).patch"
263
+
264
+ local verdict_path="$cand_dir/$(basename "$c" .json).verdict.json"
265
+ python3 "$HFB/verifier-ensemble.py" \
266
+ --change "$cand_dir/$(basename "$c" .json).patch" \
267
+ --target "$target" --kind "$kind" --confidence "$conf" \
268
+ --out "$verdict_path" >/dev/null 2>&1 || true
269
+
270
+ local ok npass
271
+ ok=$(python3 -c "import json; print(json.load(open('$verdict_path')).get('ok',False))" 2>/dev/null || echo False)
272
+ npass=$(python3 -c "import json; print(json.load(open('$verdict_path')).get('n_pass',0))" 2>/dev/null || echo 0)
273
+ local score
274
+ score=$(python3 -c "print(int($npass) + (10 if '$ok'=='True' else 0) + float($conf))")
275
+ log " cand=$(basename "$c") ok=$ok pass=$npass conf=$conf β†’ score=$score"
276
+ if (( $(python3 -c "print(1 if $score > $best_score else 0)") )); then
277
+ best="$c"; best_score=$score
278
+ fi
279
+ done
280
+ if [[ -n "$best" ]]; then
281
+ cp "$best" "$winner_out"
282
+ cp "$cand_dir/$(basename "$best" .json).verdict.json" "${winner_out%.json}.verdict.json"
283
+ log " winner=$(basename "$best") score=$best_score"
284
+ return 0
285
+ fi
286
+ return 1
287
+ }
288
+
289
+ # ── Sweep ───────────────────────────────────────────────────────────────────
290
+ sweep() {
291
+ local ts; ts=$(date -u +%Y%m%dT%H%M%SZ)
292
+ local cycle="$STATE/release-$ts"
293
+ mkdir -p "$cycle"
294
+ log "═══ release sweep $ts ═══"
295
+
296
+ recon "$cycle/recon.jsonl"
297
+ gap_analysis "$cycle/recon.jsonl" "$cycle/gaps.json"
298
+
299
+ local n_gaps
300
+ n_gaps=$(python3 -c "import json; print(len(json.load(open('$cycle/gaps.json'))))")
301
+ if (( n_gaps == 0 )); then
302
+ log " no gaps above threshold β€” skipping cycle"
303
+ return 0
304
+ fi
305
+
306
+ # Process top gap only this cycle (avoid PR flood)
307
+ python3 -c "
308
+ import json
309
+ g = json.load(open('$cycle/gaps.json'))[0]
310
+ json.dump(g, open('$cycle/top-gap.json', 'w'))
311
+ print(g['topic'])
312
+ " | while read -r topic; do
313
+ log " top gap: $topic"
314
+ local spec_path="$cycle/spec.json"
315
+ if ! build_spec "$cycle/top-gap.json" "$spec_path"; then
316
+ log " spec build failed β€” skipping"
317
+ continue
318
+ fi
319
+ local title
320
+ title=$(python3 -c "import json; print(json.load(open('$spec_path')).get('title','untitled'))")
321
+ log " spec: $title"
322
+
323
+ local cand_dir="$cycle/candidates"
324
+ local n_cand
325
+ n_cand=$(build_patch_cisc "$spec_path" "$cand_dir")
326
+ log " built $n_cand patch candidates (target $CISC_N)"
327
+ if (( n_cand == 0 )); then
328
+ python3 "$HFB/outcome-log.py" --daemon release --trigger "gap:$topic" \
329
+ --anomaly "$cycle/top-gap.json" --response "$spec_path" \
330
+ --applied false --outcome error \
331
+ --lesson "no patch candidates produced" || true
332
+ continue
333
+ fi
334
+
335
+ if pick_winner "$cand_dir" "$cycle/winner.json"; then
336
+ local ok
337
+ ok=$(python3 -c "import json; print(json.load(open('$cycle/winner.verdict.json')).get('ok',False))")
338
+ if [[ "$ok" == "True" ]]; then
339
+ log " β†’ opening draft PR"
340
+ open_draft_pr "$cycle"
341
+ else
342
+ log " winner failed verifier β€” queueing"
343
+ python3 "$HFB/outcome-log.py" --daemon release --trigger "gap:$topic" \
344
+ --anomaly "$cycle/top-gap.json" --response "$spec_path" \
345
+ --verdict "$cycle/winner.verdict.json" \
346
+ --applied false --outcome queued \
347
+ --lesson "best candidate still failed verifier" || true
348
+ fi
349
+ else
350
+ log " no winner β€” all candidates failed"
351
+ fi
352
+ done
353
+
354
+ log "═══ sweep done ═══"
355
+ }
356
+
357
+ # ── Open draft PR (gh CLI required) ─────────────────────────────────────────
358
+ open_draft_pr() {
359
+ local cycle="$1"
360
+ if ! command -v gh >/dev/null 2>&1; then
361
+ log " gh CLI missing β€” queueing instead of PR"
362
+ python3 "$HFB/outcome-log.py" --daemon release \
363
+ --trigger "release_cycle" \
364
+ --response "$cycle/winner.json" \
365
+ --applied false --outcome queued \
366
+ --lesson "gh CLI not installed" || true
367
+ return 1
368
+ fi
369
+
370
+ local target_repo="${REPOS[0]}"
371
+ local target_file patch_file branch
372
+ target_file=$(python3 -c "import json; print(json.load(open('$cycle/winner.json'))['target_file'])")
373
+ patch_file="$cycle/$(ls "$cycle/candidates"/*.patch 2>/dev/null | head -1 | xargs -n1 basename)"
374
+ branch="auto/release-$(date -u +%Y%m%d-%H%M)"
375
+
376
+ # Clone if not present
377
+ local clone_dir="$STATE/repos/$(basename "$target_repo")"
378
+ if [[ ! -d "$clone_dir/.git" ]]; then
379
+ gh repo clone "$target_repo" "$clone_dir" 2>>"$LOG" || {
380
+ log " clone failed for $target_repo"
381
+ return 1
382
+ }
383
+ fi
384
+
385
+ ( cd "$clone_dir"
386
+ git fetch origin main 2>>"$LOG"
387
+ git checkout -B "$branch" origin/main 2>>"$LOG"
388
+ patch -p1 < "$patch_file" 2>>"$LOG" || { log " patch apply failed"; exit 1; }
389
+ git add -A
390
+ git commit -m "auto-release: $(python3 -c "import json; print(json.load(open('$cycle/winner.json')).get('target_file',''))")
391
+ auto-generated by autonomous-release.sh
392
+ spec=$cycle/spec.json
393
+ verdict=$cycle/winner.verdict.json"
394
+ git push -u origin "$branch" 2>>"$LOG"
395
+ gh pr create --draft --title "[auto-release] $(python3 -c "import json; print(json.load(open('$cycle/spec.json')).get('title',''))")" \
396
+ --body "Autonomous release.
397
+
398
+ **Spec**: see \`$cycle/spec.json\`
399
+ **Verdict**: see \`$cycle/winner.verdict.json\`
400
+
401
+ This PR was generated by Surrogate-1 autonomous-release daemon. It is a DRAFT β€” promote to ready-for-review only after CI passes and a human eyeballs the diff." \
402
+ --label "autonomous-release" 2>&1 | tee -a "$LOG"
403
+ ) || true
404
+
405
+ python3 "$HFB/outcome-log.py" --daemon release \
406
+ --trigger "release_cycle" \
407
+ --anomaly "$cycle/top-gap.json" \
408
+ --response "$cycle/winner.json" \
409
+ --verdict "$cycle/winner.verdict.json" \
410
+ --applied true --outcome success \
411
+ --lesson "draft PR opened on $branch" || true
412
+ notify "draft PR opened on $target_repo / $branch"
413
+ }
414
+
415
+ if (( ONCE )); then
416
+ sweep
417
+ exit 0
418
+ fi
419
+
420
+ log "═══ autonomous-release starting (interval=${INTERVAL_SEC}s) ═══"
421
+ notify "online β€” interval ${INTERVAL_SEC}s"
422
+ while true; do
423
+ sweep
424
+ sleep "$INTERVAL_SEC"
425
+ done
bin/v2/autonomous-sre.sh ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” autonomous SRE daemon.
3
+ #
4
+ # 24Γ—7 monitors infra Surrogate-1 itself owns or operates against, and
5
+ # tries to auto-heal incidents. Every candidate action passes through
6
+ # verifier-ensemble.py β€” anything that fails verification is QUEUED, never
7
+ # applied. The whole turn (anomaly β†’ diagnosis β†’ verdict β†’ apply/queue β†’
8
+ # metric_after) is logged to outcomes.jsonl so self-improve.sh can build
9
+ # the next round's training data.
10
+ #
11
+ # Probe targets (all read-only by default):
12
+ # 1. HF Spaces health β€” runtime.stage / errorMessage
13
+ # 2. HF Datasets growth β€” last commit age (pipeline staleness)
14
+ # 3. ZeroGPU smoke test β€” small generation request
15
+ # 4. Kaggle kernel state β€” only if KAGGLE_KEY env is fresh
16
+ # 5. AWS via aws-cli β€” only if AWS_PROFILE set + Excise stack
17
+ # 6. GH Actions runs β€” `gh run list` for axentx orgs
18
+ #
19
+ # Auto-fix scope (whitelist of safe actions):
20
+ # - factory_reboot a stuck HF Space
21
+ # - re-trigger a failed GH workflow run
22
+ # - update a Space env var (already supported via swap-zerogpu-lora.sh)
23
+ # - apply a small (<300 line) diff to a file in $HOME/.surrogate/* if
24
+ # verifier-ensemble passes ALL checks
25
+ #
26
+ # Anything else β†’ queued to ~/.surrogate/state/queue/<ts>.json for
27
+ # operator review. Refused-by-policy actions are LOGGED but never queued.
28
+ #
29
+ # Usage (long-lived daemon):
30
+ # nohup bash bin/v2/autonomous-sre.sh \
31
+ # > $HOME/.surrogate/logs/autonomous-sre.log 2>&1 &
32
+ #
33
+ # Or via cron every 5 min:
34
+ # */5 * * * * bash $HOME/.surrogate/hf-space/bin/v2/autonomous-sre.sh --once
35
+ set -uo pipefail
36
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
37
+
38
+ HFB="$HOME/.surrogate/hf-space/bin/v2"
39
+ STATE="$HOME/.surrogate/state"
40
+ QUEUE="$STATE/queue"
41
+ LOG="$HOME/.surrogate/logs/autonomous-sre.log"
42
+ mkdir -p "$STATE" "$QUEUE" "$(dirname "$LOG")"
43
+
44
+ ONCE=0
45
+ [[ "${1:-}" == "--once" ]] && ONCE=1
46
+ INTERVAL_SEC="${SRE_INTERVAL_SEC:-300}" # 5 min between full sweeps
47
+ SPACE_PRIMARY="${SRE_SPACE_PRIMARY:-surrogate1/surrogate-1-zero-gpu}"
48
+ SPACE_SECONDARY="${SRE_SPACE_SECONDARY:-ashirato/surrogate-1-zero-gpu}"
49
+ DATASETS=(${SRE_DATASETS:-axentx/surrogate-1-pairs axentx/surrogate-1-pairs-shard1 axentx/surrogate-1-pairs-shard2 axentx/surrogate-1-pairs-shard3 axentx/surrogate-1-pairs-shard4})
50
+ DATASET_STALE_HOURS="${SRE_DATASET_STALE_H:-3}"
51
+
52
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
53
+ notify() {
54
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
55
+ curl -s -X POST -H "Content-Type: application/json" \
56
+ -d "{\"content\":\"πŸ›‘οΈ autonomous-sre: $1\"}" \
57
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
58
+ }
59
+
60
+ # ── Single shared call to record an anomaly + decide ────────────────────────
61
+ handle_anomaly() {
62
+ local trigger="$1" anomaly_json="$2"
63
+ local ts; ts=$(date -u +%Y%m%dT%H%M%SZ)
64
+ local work; work=$(mktemp -d "$STATE/sre-$ts-XXXX")
65
+
66
+ echo "$anomaly_json" > "$work/anomaly.json"
67
+
68
+ # Build diagnosis prompt with last 3 outcomes for context
69
+ local recent
70
+ recent=$(tail -n 3 "$STATE/outcomes.jsonl" 2>/dev/null \
71
+ | python3 -c "import sys,json
72
+ for L in sys.stdin:
73
+ try: r = json.loads(L)
74
+ except: continue
75
+ print(f\"- {r['ts']} {r['daemon']}/{r['trigger']} β†’ {r['outcome']}\")
76
+ " 2>/dev/null || echo " (none)")
77
+
78
+ cat > "$work/prompt.md" <<EOF
79
+ You are Surrogate-1 in SRE auto-heal mode. An anomaly has been detected by
80
+ the autonomous-sre daemon. Diagnose root cause and propose ONE specific
81
+ fix, OR explicitly say "fix_kind": "none" if you're <70% confident.
82
+
83
+ Trigger: $trigger
84
+
85
+ Anomaly details (JSON):
86
+ \`\`\`json
87
+ $anomaly_json
88
+ \`\`\`
89
+
90
+ Recent outcomes (last 3):
91
+ $recent
92
+
93
+ Hard constraints:
94
+ - Only propose fixes for systems Surrogate-1 owns: HF Spaces under
95
+ surrogate1/* + ashirato/* + axentx/*, HF datasets under axentx/*,
96
+ GH workflows in axentx repos. Refuse any AWS/prod/customer system.
97
+ - Diff must be <300 lines, ≀8 files.
98
+ - No destructive operations (rm -rf, DROP, kubectl delete ns, IAM \\*:\\*).
99
+ - If the fix is "factory_reboot Space X" β€” set fix_kind=shell with
100
+ patch=\`bash $HFB/swap-zerogpu-lora.sh AXENTX/<lora> ONLY=<name>\` style.
101
+
102
+ Respond ONLY with this JSON schema:
103
+ {
104
+ "diagnosis": "<one-paragraph root cause>",
105
+ "fix_kind": "code" | "iac" | "shell" | "sql" | "none",
106
+ "target_file": "<absolute path, or empty if shell-only>",
107
+ "patch": "<unified diff or shell command>",
108
+ "rollback": "<how to undo>",
109
+ "test_plan": "<how we'll know it worked>",
110
+ "confidence": 0.0-1.0
111
+ }
112
+ EOF
113
+
114
+ log " β†’ calling Surrogate for diagnosis ($trigger)"
115
+ if ! python3 "$HFB/surrogate-call.py" \
116
+ --space "$SPACE_PRIMARY" \
117
+ --prompt-file "$work/prompt.md" \
118
+ --schema diagnosis \
119
+ --max-tokens 1200 --temperature 0.15 \
120
+ --out "$work/response.json" 2>"$work/call.err"; then
121
+ log " βœ— surrogate-call failed: $(cat "$work/call.err" | head -c 200)"
122
+ python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
123
+ --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
124
+ --applied false --outcome error \
125
+ --lesson "endpoint unavailable for diagnosis" || true
126
+ return 1
127
+ fi
128
+
129
+ local fix_kind conf target patch
130
+ fix_kind=$(python3 -c "import json; print(json.load(open('$work/response.json')).get('fix_kind','none'))")
131
+ conf=$(python3 -c "import json; print(json.load(open('$work/response.json')).get('confidence',0))")
132
+ target=$(python3 -c "import json; print(json.load(open('$work/response.json')).get('target_file','') or '')")
133
+ patch=$(python3 -c "import json,sys; sys.stdout.write(json.load(open('$work/response.json')).get('patch',''))")
134
+
135
+ log " diagnosis: fix_kind=$fix_kind confidence=$conf target=$target"
136
+
137
+ if [[ "$fix_kind" == "none" ]] || [[ -z "$patch" ]]; then
138
+ log " Surrogate declined to act ($fix_kind / empty patch) β€” recording, no apply"
139
+ python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
140
+ --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
141
+ --response "$work/response.json" --applied false --outcome rejected \
142
+ --lesson "model declined low-confidence" || true
143
+ return 0
144
+ fi
145
+
146
+ # Write patch to file for verifier
147
+ echo "$patch" > "$work/patch.txt"
148
+ [[ -z "$target" ]] && target="$work/patch.txt"
149
+
150
+ # Idempotency check β€” if same patch was applied <4 h ago, skip
151
+ if python3 "$HFB/idempotency.py" check --plan "$work/patch.txt" --ttl-hours 4 \
152
+ >"$work/idem.json" 2>/dev/null; then
153
+ log " idempotent skip β€” same patch applied recently"
154
+ python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
155
+ --anomaly "$work/anomaly.json" --response "$work/response.json" \
156
+ --applied false --outcome rejected \
157
+ --lesson "idempotent: $(python3 -c "import json; print(json.load(open('$work/idem.json'))['key'][:12])")" || true
158
+ return 0
159
+ fi
160
+
161
+ log " β†’ verifier-ensemble"
162
+ local vrc=0
163
+ python3 "$HFB/verifier-ensemble.py" \
164
+ --change "$work/patch.txt" --target "$target" --kind "$fix_kind" \
165
+ --confidence "$conf" --out "$work/verdict.json" >/dev/null || vrc=$?
166
+
167
+ local verdict_ok
168
+ verdict_ok=$(python3 -c "import json; print(json.load(open('$work/verdict.json')).get('ok',False))")
169
+
170
+ if [[ "$verdict_ok" != "True" ]]; then
171
+ log " verdict: REJECTED β†’ queueing for review"
172
+ cp -r "$work" "$QUEUE/$(basename "$work")"
173
+ python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
174
+ --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
175
+ --response "$work/response.json" --verdict "$work/verdict.json" \
176
+ --applied false --outcome queued \
177
+ --lesson "verifier rejected β€” manual review" || true
178
+ notify "queued $trigger ($(python3 -c "import json; print(', '.join(json.load(open('$work/verdict.json')).get('reasons',[])[:2]))"))"
179
+ return 0
180
+ fi
181
+
182
+ log " verdict: SAFE β†’ applying"
183
+ local apply_rc=0
184
+ if [[ "$fix_kind" == "shell" ]]; then
185
+ bash -c "$patch" 2>&1 | tee "$work/apply.log" || apply_rc=$?
186
+ elif [[ "$fix_kind" == "code" || "$fix_kind" == "iac" ]] && [[ -f "$target" ]]; then
187
+ # apply unified diff
188
+ ( cd "$(dirname "$target")" && patch -p1 --dry-run < "$work/patch.txt" \
189
+ && patch -p1 < "$work/patch.txt" ) 2>&1 | tee "$work/apply.log" \
190
+ || apply_rc=$?
191
+ else
192
+ apply_rc=99
193
+ echo "no apply path for fix_kind=$fix_kind target=$target" > "$work/apply.log"
194
+ fi
195
+
196
+ if [[ $apply_rc -eq 0 ]]; then
197
+ log " βœ“ applied β€” capturing metric_after"
198
+ sleep 5
199
+ python3 "$HFB/idempotency.py" record --plan "$work/patch.txt" \
200
+ --daemon sre --outcome applied >/dev/null 2>&1 || true
201
+ python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
202
+ --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
203
+ --response "$work/response.json" --verdict "$work/verdict.json" \
204
+ --applied true --outcome success \
205
+ --lesson "auto-heal worked first try" || true
206
+ notify "auto-healed $trigger (confidence=$conf)"
207
+ else
208
+ log " βœ— apply failed rc=$apply_rc β€” rolling back"
209
+ # best-effort rollback hint logged but not auto-applied
210
+ python3 "$HFB/outcome-log.py" --daemon sre --trigger "$trigger" \
211
+ --anomaly "$work/anomaly.json" --prompt "$work/prompt.md" \
212
+ --response "$work/response.json" --verdict "$work/verdict.json" \
213
+ --applied true --outcome rollback \
214
+ --lesson "apply rc=$apply_rc; rollback:$(python3 -c "import json; print(json.load(open('$work/response.json')).get('rollback','none')[:80])")" || true
215
+ notify "ROLLBACK $trigger rc=$apply_rc"
216
+ fi
217
+ }
218
+
219
+ # ��─ Probe 1: HF Space health ────────────────────────────────────────────────
220
+ probe_space() {
221
+ local space="$1"
222
+ local resp; resp=$(curl -fsS --max-time 15 \
223
+ ${HF_TOKEN:+-H "Authorization: Bearer $HF_TOKEN"} \
224
+ "https://huggingface.co/api/spaces/$space" 2>/dev/null) || return 0
225
+ local stage err
226
+ stage=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin).get('runtime',{}).get('stage','UNKNOWN'))" 2>/dev/null)
227
+ err=$(echo "$resp" | python3 -c "import json,sys; print(json.load(sys.stdin).get('runtime',{}).get('errorMessage','') or '')" 2>/dev/null)
228
+
229
+ case "$stage" in
230
+ RUNNING|BUILDING|CONFIG_ERROR_QUEUED|RUNNING_BUILDING) ;; # nominal/expected
231
+ STOPPED|RUNTIME_ERROR|BUILD_ERROR|NO_APP_FILE|*ERROR*)
232
+ log " ⚠ Space $space stage=$stage err=$err"
233
+ handle_anomaly "hf_space_${stage,,}" \
234
+ "$(printf '{"space":"%s","stage":"%s","error":%s}' \
235
+ "$space" "$stage" "$(python3 -c "import json; print(json.dumps('$err'))")")"
236
+ ;;
237
+ *) log " Space $space stage=$stage (no action)" ;;
238
+ esac
239
+ }
240
+
241
+ # ── Probe 2: dataset growth (staleness) ─────────────────────────────────────
242
+ probe_dataset_staleness() {
243
+ local ds="$1"
244
+ local resp; resp=$(curl -fsS --max-time 15 \
245
+ ${HF_TOKEN:+-H "Authorization: Bearer $HF_TOKEN"} \
246
+ "https://huggingface.co/api/datasets/$ds" 2>/dev/null) || return 0
247
+ local last_modified
248
+ last_modified=$(echo "$resp" | python3 -c "
249
+ import json, sys, datetime
250
+ try:
251
+ d = json.load(sys.stdin)
252
+ lm = d.get('lastModified') or d.get('createdAt')
253
+ print(lm or '')
254
+ except: print('')
255
+ " 2>/dev/null)
256
+ [[ -z "$last_modified" ]] && return 0
257
+ local age_h
258
+ age_h=$(python3 -c "
259
+ import datetime
260
+ lm = datetime.datetime.fromisoformat('${last_modified}'.replace('Z','+00:00'))
261
+ now = datetime.datetime.now(datetime.timezone.utc)
262
+ print(int((now - lm).total_seconds() / 3600))
263
+ " 2>/dev/null || echo 0)
264
+ if (( age_h > DATASET_STALE_HOURS )); then
265
+ log " ⚠ dataset $ds stale ${age_h}h (threshold ${DATASET_STALE_HOURS}h)"
266
+ handle_anomaly "hf_dataset_stale" \
267
+ "$(printf '{"dataset":"%s","age_hours":%d,"threshold":%d}' \
268
+ "$ds" "$age_h" "$DATASET_STALE_HOURS")"
269
+ fi
270
+ }
271
+
272
+ # ── Probe 3: ZeroGPU smoke (cheapest health signal) ─────────────────────────
273
+ probe_zerogpu_smoke() {
274
+ local space="$1"
275
+ local url="https://${space//\//-}.hf.space/api/predict"
276
+ if ! curl -fsS --max-time 30 -X POST -H "Content-Type: application/json" \
277
+ -d '{"data":["ping","hi",16,0.1]}' "$url" >/dev/null 2>&1; then
278
+ log " ⚠ ZeroGPU smoke FAILED on $space"
279
+ handle_anomaly "zerogpu_smoke_fail" \
280
+ "$(printf '{"space":"%s","url":"%s"}' "$space" "$url")"
281
+ fi
282
+ }
283
+
284
+ # ── Probe 4: GH Actions failures (best-effort) ──────────────────────────────
285
+ probe_gh_actions() {
286
+ if ! command -v gh >/dev/null 2>&1; then return 0; fi
287
+ for repo in axentx/arkashira axentx/midnightcrisis; do
288
+ local failed
289
+ failed=$(gh run list --repo "$repo" --limit 5 --json status,conclusion,name \
290
+ 2>/dev/null | python3 -c "
291
+ import json, sys
292
+ try: runs = json.load(sys.stdin)
293
+ except: runs = []
294
+ fails = [r for r in runs if r.get('conclusion') == 'failure']
295
+ print(len(fails))
296
+ " 2>/dev/null || echo 0)
297
+ if (( failed >= 2 )); then
298
+ log " ⚠ GH $repo: $failed of last 5 runs failed"
299
+ handle_anomaly "gh_workflow_repeated_failure" \
300
+ "$(printf '{"repo":"%s","failed_of_5":%d}' "$repo" "$failed")"
301
+ fi
302
+ done
303
+ }
304
+
305
+ # ── Probe 5: outcome log self-consistency (meta) ────────────────────────────
306
+ probe_outcome_log_health() {
307
+ if [[ ! -f "$STATE/outcomes.jsonl" ]]; then return 0; fi
308
+ local n_recent_fail
309
+ n_recent_fail=$(tail -n 20 "$STATE/outcomes.jsonl" 2>/dev/null | python3 -c "
310
+ import sys, json
311
+ n = 0
312
+ for L in sys.stdin:
313
+ try: r = json.loads(L)
314
+ except: continue
315
+ if r.get('outcome') in ('rollback','error'): n += 1
316
+ print(n)
317
+ " 2>/dev/null || echo 0)
318
+ if (( n_recent_fail >= 5 )); then
319
+ log " ⚠ ${n_recent_fail}/20 recent outcomes failed β†’ degrading mode"
320
+ notify "degrading: $n_recent_fail/20 recent fails β€” operator review"
321
+ fi
322
+ }
323
+
324
+ # ── Sweep ───────────────────────────────────────────────────────────────────
325
+ sweep() {
326
+ log "═══ SRE sweep ═══"
327
+ probe_space "$SPACE_PRIMARY"
328
+ probe_space "$SPACE_SECONDARY"
329
+ for ds in "${DATASETS[@]}"; do probe_dataset_staleness "$ds"; done
330
+ probe_zerogpu_smoke "$SPACE_PRIMARY"
331
+ probe_gh_actions
332
+ probe_outcome_log_health
333
+ log "═══ sweep done ═══"
334
+ }
335
+
336
+ if (( ONCE )); then
337
+ sweep
338
+ exit 0
339
+ fi
340
+
341
+ log "═══ autonomous-sre starting (interval=${INTERVAL_SEC}s) ═══"
342
+ notify "online β€” interval ${INTERVAL_SEC}s"
343
+ while true; do
344
+ sweep
345
+ sleep "$INTERVAL_SEC"
346
+ done
bin/v2/bench-v1-vs-v15.sh CHANGED
@@ -37,6 +37,7 @@ MODELS=(
37
  "v1|axentx/surrogate-1-coder-7b-v1|Qwen/Qwen2.5-Coder-7B-Instruct"
38
  "base7B|Qwen/Qwen2.5-Coder-7B-Instruct|"
39
  "v1.1-extended|axentx/surrogate-1-7B-v1.1-extended|Qwen/Qwen2.5-Coder-7B-Instruct"
 
40
  )
41
  # Bench ladder pivoted 2026-05-01 after V4 (32B OOM) + V5 (14B OOM) both
42
  # crashed Kaggle T4Γ—2. Pick 7B as the validation base β€” fits T4Γ—2 cleanly,
@@ -126,7 +127,7 @@ run_eval() {
126
  SWE_RESOLVED=$(grep -oE "resolved.*[0-9]+\.[0-9]+" "$out/swebench.log" 2>/dev/null | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
127
 
128
  # ── 7. axentx-eval-50 (custom in-domain DevSecOps eval) ──
129
- log " [7/7] axentx-eval-50 (custom DevSecOps)"
130
  if [[ -f "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" ]]; then
131
  python3 "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" \
132
  --model "$mdl" --out "$out/axentx-eval" 2>&1 | tee -a "$out/axentx-eval.log" | tail -30
@@ -135,6 +136,26 @@ run_eval() {
135
  AXENTX_SCORE="--"
136
  fi
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  # Persist scores
139
  python3 - <<PYEOF
140
  import json
@@ -147,6 +168,8 @@ data["$label"] = {
147
  "ruler_16k_avg": "${RULER_AVG:-?}",
148
  "swebench_verified_lite100": "${SWE_RESOLVED:-?}",
149
  "axentx_eval_50": "${AXENTX_SCORE:-?}",
 
 
150
  }
151
  json.dump(data, open("$SUMMARY_JSON", "w"), indent=2)
152
  PYEOF
 
37
  "v1|axentx/surrogate-1-coder-7b-v1|Qwen/Qwen2.5-Coder-7B-Instruct"
38
  "base7B|Qwen/Qwen2.5-Coder-7B-Instruct|"
39
  "v1.1-extended|axentx/surrogate-1-7B-v1.1-extended|Qwen/Qwen2.5-Coder-7B-Instruct"
40
+ "v1.2-research|axentx/surrogate-1-7B-v1.2-research|Qwen/Qwen2.5-Coder-7B-Instruct"
41
  )
42
  # Bench ladder pivoted 2026-05-01 after V4 (32B OOM) + V5 (14B OOM) both
43
  # crashed Kaggle T4Γ—2. Pick 7B as the validation base β€” fits T4Γ—2 cleanly,
 
127
  SWE_RESOLVED=$(grep -oE "resolved.*[0-9]+\.[0-9]+" "$out/swebench.log" 2>/dev/null | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
128
 
129
  # ── 7. axentx-eval-50 (custom in-domain DevSecOps eval) ──
130
+ log " [7/9] axentx-eval-50 (custom DevSecOps)"
131
  if [[ -f "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" ]]; then
132
  python3 "$HOME/.surrogate/hf-space/bin/v2/axentx-eval-50.py" \
133
  --model "$mdl" --out "$out/axentx-eval" 2>&1 | tee -a "$out/axentx-eval.log" | tail -30
 
136
  AXENTX_SCORE="--"
137
  fi
138
 
139
+ # ── 8. Multi-IaC-Eval (NEW V8) β€” CFN+TF+CDK pass-rate w/ cfn-guard +tfsec ─
140
+ log " [8/9] Multi-IaC-Eval (CFN/TF/CDK)"
141
+ if [[ -f "$HOME/.surrogate/hf-space/bin/v2/multi-iac-eval.py" ]]; then
142
+ python3 "$HOME/.surrogate/hf-space/bin/v2/multi-iac-eval.py" \
143
+ --model "$mdl" --out "$out/multi-iac" 2>&1 | tee -a "$out/multi-iac.log" | tail -30
144
+ MULTI_IAC=$(grep -oE "iac_pass_rate.*[0-9]+\.[0-9]+" "$out/multi-iac.log" | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
145
+ else
146
+ MULTI_IAC="--"
147
+ fi
148
+
149
+ # ── 9. ITBench-lite (NEW V8) β€” 102 K8s SRE/CISO/FinOps scenarios ──
150
+ log " [9/9] ITBench-lite"
151
+ if [[ -f "$HOME/.surrogate/hf-space/bin/v2/itbench-lite.py" ]]; then
152
+ python3 "$HOME/.surrogate/hf-space/bin/v2/itbench-lite.py" \
153
+ --model "$mdl" --out "$out/itbench" 2>&1 | tee -a "$out/itbench.log" | tail -30
154
+ ITBENCH=$(grep -oE "itbench_score.*[0-9]+\.[0-9]+" "$out/itbench.log" | tail -1 | grep -oE "[0-9]+\.[0-9]+" | tail -1)
155
+ else
156
+ ITBENCH="--"
157
+ fi
158
+
159
  # Persist scores
160
  python3 - <<PYEOF
161
  import json
 
168
  "ruler_16k_avg": "${RULER_AVG:-?}",
169
  "swebench_verified_lite100": "${SWE_RESOLVED:-?}",
170
  "axentx_eval_50": "${AXENTX_SCORE:-?}",
171
+ "multi_iac_eval": "${MULTI_IAC:-?}",
172
+ "itbench_lite": "${ITBENCH:-?}",
173
  }
174
  json.dump(data, open("$SUMMARY_JSON", "w"), indent=2)
175
  PYEOF
bin/v2/idempotency.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 β€” idempotency keys (research Β§autonomous-24x7 pattern 2).
3
+
4
+ Every autonomous action computes idempotency_key = sha256(plan). If the
5
+ same key has been seen within the TTL, the action is treated as already-
6
+ applied and SKIPPED (preventing replay storms when the same anomaly fires
7
+ twice in a row). Records live in a JSONL ledger.
8
+
9
+ Ledger entry:
10
+ {"key":"<sha256>", "ts":"...", "daemon":"sre|release", "outcome":"applied|queued"}
11
+
12
+ Usage:
13
+ # Check if seen recently β€” exit 0 if seen (skip), 1 if new
14
+ idempotency.py check --plan /path/to/plan.json --ttl-hours 4
15
+
16
+ # Record after applying
17
+ idempotency.py record --plan /path/to/plan.json \
18
+ --daemon sre --outcome applied
19
+ """
20
+ from __future__ import annotations
21
+
22
+ import argparse
23
+ import datetime as dt
24
+ import hashlib
25
+ import json
26
+ import os
27
+ import sys
28
+ from pathlib import Path
29
+
30
+ LEDGER = Path(os.environ.get(
31
+ "SURROGATE_IDEMPOTENCY_LEDGER",
32
+ str(Path.home() / ".surrogate/state/idempotency.jsonl")))
33
+
34
+
35
+ def compute_key(plan_path: Path) -> str:
36
+ txt = plan_path.read_text() if plan_path.is_file() else str(plan_path)
37
+ h = hashlib.sha256()
38
+ h.update(txt.encode())
39
+ return h.hexdigest()
40
+
41
+
42
+ def load_ledger() -> list[dict]:
43
+ if not LEDGER.exists():
44
+ return []
45
+ out = []
46
+ for L in LEDGER.read_text().splitlines():
47
+ try:
48
+ out.append(json.loads(L))
49
+ except Exception:
50
+ continue
51
+ return out
52
+
53
+
54
+ def append_ledger(rec: dict) -> None:
55
+ LEDGER.parent.mkdir(parents=True, exist_ok=True)
56
+ with LEDGER.open("a") as f:
57
+ f.write(json.dumps(rec) + "\n")
58
+
59
+
60
+ def is_recent(key: str, ttl_hours: float) -> bool:
61
+ cutoff = dt.datetime.now(dt.timezone.utc) - dt.timedelta(hours=ttl_hours)
62
+ for r in load_ledger():
63
+ if r.get("key") != key:
64
+ continue
65
+ try:
66
+ ts = dt.datetime.strptime(r["ts"], "%Y-%m-%dT%H:%M:%SZ")
67
+ except Exception:
68
+ continue
69
+ if ts > cutoff:
70
+ return True
71
+ return False
72
+
73
+
74
+ def main() -> int:
75
+ p = argparse.ArgumentParser()
76
+ sp = p.add_subparsers(dest="cmd", required=True)
77
+
78
+ pc = sp.add_parser("check")
79
+ pc.add_argument("--plan", required=True)
80
+ pc.add_argument("--ttl-hours", type=float, default=4.0)
81
+
82
+ pr = sp.add_parser("record")
83
+ pr.add_argument("--plan", required=True)
84
+ pr.add_argument("--daemon", required=True)
85
+ pr.add_argument("--outcome", required=True)
86
+
87
+ pk = sp.add_parser("key")
88
+ pk.add_argument("--plan", required=True)
89
+
90
+ args = p.parse_args()
91
+
92
+ if args.cmd == "key":
93
+ print(compute_key(Path(args.plan)))
94
+ return 0
95
+
96
+ key = compute_key(Path(args.plan))
97
+
98
+ if args.cmd == "check":
99
+ seen = is_recent(key, args.ttl_hours)
100
+ print(json.dumps({"key": key, "seen_recently": seen,
101
+ "ttl_hours": args.ttl_hours}))
102
+ return 0 if seen else 1 # 0 = seen (skip); 1 = new (proceed)
103
+
104
+ if args.cmd == "record":
105
+ append_ledger({
106
+ "key": key,
107
+ "ts": dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
108
+ "daemon": args.daemon,
109
+ "outcome": args.outcome,
110
+ })
111
+ print(f"recorded {key[:12]}…")
112
+ return 0
113
+
114
+ return 2
115
+
116
+
117
+ if __name__ == "__main__":
118
+ sys.exit(main())
bin/v2/outcome-log.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 β€” outcome logger.
3
+
4
+ All autonomous daemons (autonomous-sre, autonomous-release) call this
5
+ after every action to append a structured record to the outcomes log.
6
+ self-improve.sh reads that log to build the next training round's
7
+ preference + SFT data.
8
+
9
+ One JSONL record per action:
10
+ {
11
+ "ts": "2026-05-01T12:34:56Z",
12
+ "daemon": "sre" | "release",
13
+ "trigger": "...probe_name...",
14
+ "anomaly": {...probe details...},
15
+ "prompt": "<full prompt sent to Surrogate>",
16
+ "response": {...Surrogate's parsed JSON output...},
17
+ "verdict": {...verifier-ensemble JSON...},
18
+ "applied": true|false,
19
+ "outcome": "success" | "rollback" | "queued" | "rejected",
20
+ "metric_after": {...optional post-action observation...},
21
+ "lesson": "optional one-line takeaway"
22
+ }
23
+
24
+ Usage:
25
+ outcome-log.py --daemon sre --trigger hf_space_stage_failed \
26
+ --anomaly /tmp/anomaly.json \
27
+ --prompt /tmp/prompt.md \
28
+ --response /tmp/response.json \
29
+ --verdict /tmp/verdict.json \
30
+ --applied true --outcome success \
31
+ [--lesson "factory_reboot fixed stuck Space"]
32
+ """
33
+ from __future__ import annotations
34
+
35
+ import argparse
36
+ import datetime as dt
37
+ import json
38
+ import os
39
+ import sys
40
+ from pathlib import Path
41
+
42
+ LOG_PATH = Path(os.environ.get(
43
+ "SURROGATE_OUTCOME_LOG",
44
+ str(Path.home() / ".surrogate/state/outcomes.jsonl")))
45
+
46
+
47
+ def _maybe_load(p: str | None) -> object | None:
48
+ if not p:
49
+ return None
50
+ pp = Path(p)
51
+ if not pp.exists():
52
+ return p # treat as inline string
53
+ txt = pp.read_text()
54
+ try:
55
+ return json.loads(txt)
56
+ except Exception:
57
+ return txt # not JSON β†’ store raw
58
+
59
+
60
+ def main() -> int:
61
+ p = argparse.ArgumentParser()
62
+ p.add_argument("--daemon", required=True, choices=["sre", "release", "manual"])
63
+ p.add_argument("--trigger", required=True)
64
+ p.add_argument("--anomaly", default=None,
65
+ help="path to JSON file or inline string")
66
+ p.add_argument("--prompt", default=None)
67
+ p.add_argument("--response", default=None)
68
+ p.add_argument("--verdict", default=None)
69
+ p.add_argument("--applied", choices=["true", "false"], required=True)
70
+ p.add_argument("--outcome", required=True,
71
+ choices=["success", "rollback", "queued", "rejected", "error"])
72
+ p.add_argument("--metric-after", default=None)
73
+ p.add_argument("--lesson", default=None)
74
+ args = p.parse_args()
75
+
76
+ rec = {
77
+ "ts": dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%SZ"),
78
+ "daemon": args.daemon,
79
+ "trigger": args.trigger,
80
+ "anomaly": _maybe_load(args.anomaly),
81
+ "prompt": _maybe_load(args.prompt),
82
+ "response": _maybe_load(args.response),
83
+ "verdict": _maybe_load(args.verdict),
84
+ "applied": args.applied == "true",
85
+ "outcome": args.outcome,
86
+ "metric_after": _maybe_load(args.metric_after),
87
+ "lesson": args.lesson,
88
+ }
89
+ LOG_PATH.parent.mkdir(parents=True, exist_ok=True)
90
+ with LOG_PATH.open("a") as f:
91
+ f.write(json.dumps(rec, ensure_ascii=False) + "\n")
92
+ print(f"logged outcome: {args.daemon}/{args.trigger} β†’ {args.outcome} "
93
+ f"(applied={args.applied})", file=sys.stderr)
94
+ return 0
95
+
96
+
97
+ if __name__ == "__main__":
98
+ sys.exit(main())
bin/v2/self-improve.sh ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” self-improvement data flywheel.
3
+ #
4
+ # Reads outcomes.jsonl (produced by autonomous-sre.sh + autonomous-release.sh)
5
+ # and converts the success/failure signal into training data for the next
6
+ # round, then triggers a refresh when accumulation crosses thresholds.
7
+ #
8
+ # Pipeline (from research Β§self-improvement.md, cron cadence aligned to
9
+ # research recommendations):
10
+ # 1. Aggregate outcomes since last run.
11
+ # 2. Split into:
12
+ # SUCCESS = applied && outcome=success
13
+ # FAIL = applied && outcome in (rollback, error)
14
+ # REJECTED = !applied β€” verifier blocked it
15
+ # 3. Build 3 datasets:
16
+ # a) SFT replay β€” SUCCESS only, formatted as prompt/response pairs
17
+ # (RLEF-style: model wrote it, executor approved)
18
+ # b) KTO unpaired β€” every outcome with binary "thumbs" label
19
+ # (KTO doesn't need pairs β€” lossless on logs)
20
+ # c) Skill library β€” verified procedures from SUCCESS, indexed by topic
21
+ # 4. Push to HF Hub:
22
+ # axentx/surrogate-1-self-traces (SFT)
23
+ # axentx/surrogate-1-pref-kto (KTO)
24
+ # axentx/surrogate-1-skills (skill library)
25
+ # 5. If SFT pairs β‰₯ SFT_TRIGGER_N OR KTO β‰₯ KTO_TRIGGER_N β†’ kick training:
26
+ # - Bumps Kaggle kernel version (or notifies user to upload)
27
+ # - Logs decision to outcomes.jsonl with daemon=manual trigger=self-improve
28
+ #
29
+ # Cadence (from research):
30
+ # - SFT replay weekly Sun 5am (cheap)
31
+ # - KTO refresh biweekly (1st + 15th)
32
+ # - Skill index daily 4am (free)
33
+ # - Trigger train when thresholds met
34
+ #
35
+ # Usage:
36
+ # bash bin/v2/self-improve.sh # run all stages, idempotent
37
+ # bash bin/v2/self-improve.sh sft # just SFT replay
38
+ # bash bin/v2/self-improve.sh kto # just KTO build
39
+ # bash bin/v2/self-improve.sh skills # just skill library
40
+ # bash bin/v2/self-improve.sh status # report counts only
41
+ set -uo pipefail
42
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
43
+
44
+ HFB="$HOME/.surrogate/hf-space/bin/v2"
45
+ STATE="$HOME/.surrogate/state"
46
+ OUTCOMES="$STATE/outcomes.jsonl"
47
+ WORK="$STATE/self-improve"
48
+ LOG="$HOME/.surrogate/logs/self-improve.log"
49
+ mkdir -p "$WORK" "$(dirname "$LOG")"
50
+
51
+ CMD="${1:-all}"
52
+
53
+ # Trigger thresholds β€” research recommends weekly SFT @ ~$14 H200 cost
54
+ SFT_TRIGGER_N="${SI_SFT_TRIGGER_N:-200}"
55
+ KTO_TRIGGER_N="${SI_KTO_TRIGGER_N:-500}"
56
+
57
+ # HF Hub repos for the three artifact streams
58
+ SFT_REPO="${SI_SFT_REPO:-axentx/surrogate-1-self-traces}"
59
+ KTO_REPO="${SI_KTO_REPO:-axentx/surrogate-1-pref-kto}"
60
+ SKILL_REPO="${SI_SKILL_REPO:-axentx/surrogate-1-skills}"
61
+
62
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
63
+ notify() {
64
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
65
+ curl -s -X POST -H "Content-Type: application/json" \
66
+ -d "{\"content\":\"♻️ self-improve: $1\"}" \
67
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
68
+ }
69
+
70
+ # ── Stage: status report ────────────────────────────────────────────────────
71
+ status() {
72
+ if [[ ! -f "$OUTCOMES" ]]; then
73
+ log "no outcomes.jsonl yet β€” daemons haven't logged anything"
74
+ return 0
75
+ fi
76
+ python3 - <<PYEOF
77
+ import json, collections
78
+ from pathlib import Path
79
+ n = collections.Counter()
80
+ by_daemon = collections.Counter()
81
+ trigger = collections.Counter()
82
+ for L in Path("$OUTCOMES").read_text().splitlines():
83
+ try: r = json.loads(L)
84
+ except: continue
85
+ n[r.get("outcome","?")] += 1
86
+ by_daemon[r.get("daemon","?")] += 1
87
+ trigger[r.get("trigger","?")] += 1
88
+ print(f" total outcomes: {sum(n.values())}")
89
+ print(f" by outcome: {dict(n)}")
90
+ print(f" by daemon: {dict(by_daemon)}")
91
+ print(f" top triggers:")
92
+ for t, c in trigger.most_common(8):
93
+ print(f" {c:4d} {t}")
94
+ PYEOF
95
+ }
96
+
97
+ # ── Stage: SFT replay (RLEF-aligned) ────────────────────────────────────────
98
+ build_sft() {
99
+ log "── SFT replay build ──"
100
+ [[ ! -f "$OUTCOMES" ]] && { log " no outcomes file β€” skip"; return 0; }
101
+ python3 - <<'PYEOF' "$OUTCOMES" "$WORK/sft.jsonl"
102
+ import json, sys
103
+ from pathlib import Path
104
+ src, dst = sys.argv[1], sys.argv[2]
105
+ n_in = n_out = 0
106
+ with open(dst, "w") as out:
107
+ for L in Path(src).read_text().splitlines():
108
+ n_in += 1
109
+ try: r = json.loads(L)
110
+ except: continue
111
+ if not r.get("applied"): continue
112
+ if r.get("outcome") != "success": continue
113
+ # The model's diagnosis/spec/patch IS the response. The trigger +
114
+ # anomaly together form the prompt.
115
+ prompt = (
116
+ f"You are Surrogate-1 in {r.get('daemon','?')} mode.\n"
117
+ f"Trigger: {r.get('trigger','?')}\n"
118
+ f"Anomaly:\n```json\n{json.dumps(r.get('anomaly'), indent=2)}\n```\n"
119
+ f"Output a JSON action with diagnosis + patch."
120
+ )
121
+ resp = r.get("response")
122
+ if not isinstance(resp, dict): continue
123
+ out.write(json.dumps({
124
+ "prompt": prompt,
125
+ "response": json.dumps(resp, indent=2),
126
+ "source": "self-trace",
127
+ "ts": r.get("ts"),
128
+ "trigger": r.get("trigger"),
129
+ "lesson": r.get("lesson"),
130
+ }, ensure_ascii=False) + "\n")
131
+ n_out += 1
132
+ print(f" SFT pairs: {n_out} (read {n_in})")
133
+ PYEOF
134
+ local n; n=$(wc -l < "$WORK/sft.jsonl" | tr -d ' ')
135
+ log " β†’ $WORK/sft.jsonl ($n pairs)"
136
+ if (( n >= SFT_TRIGGER_N )); then
137
+ log " threshold met ($n β‰₯ $SFT_TRIGGER_N) β€” pushing + flagging trigger"
138
+ push_dataset "$SFT_REPO" "$WORK/sft.jsonl"
139
+ trigger_next_round "sft" "$n"
140
+ else
141
+ log " below trigger ($n < $SFT_TRIGGER_N) β€” accumulating"
142
+ fi
143
+ }
144
+
145
+ # ── Stage: KTO unpaired preferences ─────────────────────────────────────────
146
+ build_kto() {
147
+ log "── KTO unpaired build ──"
148
+ [[ ! -f "$OUTCOMES" ]] && { log " no outcomes file β€” skip"; return 0; }
149
+ python3 - <<'PYEOF' "$OUTCOMES" "$WORK/kto.jsonl"
150
+ import json, sys
151
+ from pathlib import Path
152
+ src, dst = sys.argv[1], sys.argv[2]
153
+ n = 0
154
+ with open(dst, "w") as out:
155
+ for L in Path(src).read_text().splitlines():
156
+ try: r = json.loads(L)
157
+ except: continue
158
+ oc = r.get("outcome")
159
+ if oc not in ("success","rollback","error","queued","rejected"): continue
160
+ # KTO label: True = applied & success, False = anything else
161
+ label = bool(r.get("applied")) and (oc == "success")
162
+ prompt = (
163
+ f"trigger={r.get('trigger','?')} daemon={r.get('daemon','?')}\n"
164
+ f"anomaly={json.dumps(r.get('anomaly'))[:400]}"
165
+ )
166
+ resp = r.get("response")
167
+ if not isinstance(resp, dict): continue
168
+ out.write(json.dumps({
169
+ "prompt": prompt,
170
+ "completion": json.dumps(resp)[:2000],
171
+ "label": label,
172
+ "ts": r.get("ts"),
173
+ }, ensure_ascii=False) + "\n")
174
+ n += 1
175
+ print(f" KTO rows: {n}")
176
+ PYEOF
177
+ local n; n=$(wc -l < "$WORK/kto.jsonl" | tr -d ' ')
178
+ log " β†’ $WORK/kto.jsonl ($n rows)"
179
+ if (( n >= KTO_TRIGGER_N )); then
180
+ log " threshold met β€” pushing"
181
+ push_dataset "$KTO_REPO" "$WORK/kto.jsonl"
182
+ trigger_next_round "kto" "$n"
183
+ fi
184
+ }
185
+
186
+ # ── Stage: skill library ────────────────────────────────────────────────────
187
+ build_skills() {
188
+ log "── skill library build ──"
189
+ [[ ! -f "$OUTCOMES" ]] && { log " no outcomes file β€” skip"; return 0; }
190
+ python3 - <<'PYEOF' "$OUTCOMES" "$WORK/skills.jsonl"
191
+ import json, sys, collections
192
+ from pathlib import Path
193
+ src, dst = sys.argv[1], sys.argv[2]
194
+ # Group successful patches by trigger keyword to form a skill = (keyword, top-N successful patches)
195
+ groups = collections.defaultdict(list)
196
+ for L in Path(src).read_text().splitlines():
197
+ try: r = json.loads(L)
198
+ except: continue
199
+ if not (r.get("applied") and r.get("outcome") == "success"): continue
200
+ resp = r.get("response")
201
+ if not isinstance(resp, dict): continue
202
+ trig = r.get("trigger","misc").split(":")[0]
203
+ groups[trig].append({
204
+ "patch": resp.get("patch",""),
205
+ "rollback": resp.get("rollback",""),
206
+ "test_plan": resp.get("test_plan",""),
207
+ "ts": r.get("ts"),
208
+ })
209
+ n = 0
210
+ with open(dst, "w") as out:
211
+ for trig, items in groups.items():
212
+ items.sort(key=lambda x: x.get("ts",""), reverse=True)
213
+ out.write(json.dumps({
214
+ "skill": trig,
215
+ "n_examples": len(items),
216
+ "examples": items[:5], # keep top 5 most-recent
217
+ }, ensure_ascii=False) + "\n")
218
+ n += 1
219
+ print(f" skills: {n}")
220
+ PYEOF
221
+ local n; n=$(wc -l < "$WORK/skills.jsonl" | tr -d ' ')
222
+ log " β†’ $WORK/skills.jsonl ($n skills)"
223
+ if (( n > 0 )); then
224
+ push_dataset "$SKILL_REPO" "$WORK/skills.jsonl"
225
+ fi
226
+ }
227
+
228
+ # ── Push to HF Hub via huggingface_hub Python API ───────────────────────────
229
+ push_dataset() {
230
+ local repo="$1" path="$2"
231
+ if [[ -z "${HF_TOKEN:-}" ]]; then
232
+ log " HF_TOKEN missing β€” saving locally only"
233
+ return 0
234
+ fi
235
+ python3 - <<PYEOF
236
+ import os
237
+ from huggingface_hub import HfApi, create_repo
238
+ api = HfApi(token=os.environ["HF_TOKEN"])
239
+ try:
240
+ create_repo("$repo", repo_type="dataset", exist_ok=True, private=False)
241
+ except Exception as e:
242
+ print(f" create_repo: {type(e).__name__}: {e}")
243
+ api.upload_file(
244
+ path_or_fileobj="$path",
245
+ path_in_repo="$(basename "$path")",
246
+ repo_id="$repo",
247
+ repo_type="dataset",
248
+ commit_message="self-improve: $(basename "$path") $(date -u +%Y%m%dT%H%MZ)",
249
+ )
250
+ print(f" pushed β†’ https://huggingface.co/datasets/$repo")
251
+ PYEOF
252
+ }
253
+
254
+ # ── Trigger next training round ─────────────────────────────────────────────
255
+ trigger_next_round() {
256
+ local stage="$1" n="$2"
257
+ log " TRIGGER next training round (stage=$stage n=$n)"
258
+ notify "$stage threshold reached ($n) β€” flagging next training round"
259
+ python3 "$HFB/outcome-log.py" --daemon manual --trigger "self-improve-trigger-$stage" \
260
+ --applied false --outcome queued \
261
+ --lesson "$stage threshold reached ($n) β€” V8 training queued" || true
262
+ # If Kaggle CLI ever returns to a working state, this is where we'd
263
+ # call `kaggle kernels push`. For now, write a flag file the user
264
+ # checks manually.
265
+ echo "$(date -u +%Y%m%dT%H%MZ) $stage n=$n" >> "$STATE/training-queue.log"
266
+ }
267
+
268
+ # ── Dispatcher ──────────────────────────────────────────────────────────────
269
+ case "$CMD" in
270
+ status) status ;;
271
+ sft) build_sft ;;
272
+ kto) build_kto ;;
273
+ skills) build_skills ;;
274
+ all)
275
+ status
276
+ build_skills
277
+ build_sft
278
+ build_kto
279
+ ;;
280
+ *) echo "usage: $0 {all|sft|kto|skills|status}" >&2; exit 2 ;;
281
+ esac
282
+
283
+ log "done"
bin/v2/surrogate-call.py ADDED
@@ -0,0 +1,177 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 β€” single-shot call to the ZeroGPU endpoint with strict JSON parse.
3
+
4
+ Used by autonomous-sre.sh + autonomous-release.sh to ask Surrogate-1 for
5
+ a structured diagnosis/spec/patch. Returns parsed JSON on stdout, exits 0
6
+ if the response is valid + matches the schema, else non-zero.
7
+
8
+ Usage:
9
+ surrogate-call.py \
10
+ --space surrogate1/surrogate-1-zero-gpu \
11
+ --prompt-file /tmp/prompt.md \
12
+ --schema diagnosis|spec|patch \
13
+ [--max-tokens 1024] [--temperature 0.2] \
14
+ [--retries 2] [--out /tmp/response.json]
15
+
16
+ Env:
17
+ HF_TOKEN (or HF_TOKEN_PRO) β€” required for private/queued Space
18
+ SURROGATE_TIMEOUT_SEC=120 β€” per-call timeout
19
+ SURROGATE_RETRY_BACKOFF_SEC=15 β€” sleep between retries
20
+ """
21
+ from __future__ import annotations
22
+
23
+ import argparse
24
+ import json
25
+ import os
26
+ import re
27
+ import sys
28
+ import time
29
+ from pathlib import Path
30
+ from urllib import request, error
31
+
32
+ TIMEOUT = int(os.environ.get("SURROGATE_TIMEOUT_SEC", "120"))
33
+ BACKOFF = int(os.environ.get("SURROGATE_RETRY_BACKOFF_SEC", "15"))
34
+
35
+ SCHEMAS = {
36
+ "diagnosis": {
37
+ "required": ["diagnosis", "fix_kind", "confidence"],
38
+ "fix_kind_enum": ["code", "iac", "shell", "sql", "none"],
39
+ "extras": ["patch", "target_file", "rollback", "test_plan"],
40
+ },
41
+ "spec": {
42
+ "required": ["title", "problem", "user_stories",
43
+ "acceptance_criteria", "impact", "confidence"],
44
+ "extras": ["competitors_observed", "out_of_scope", "rollout_plan"],
45
+ },
46
+ "patch": {
47
+ "required": ["target_file", "patch", "kind",
48
+ "test_plan", "rollback", "confidence"],
49
+ "extras": ["fix_kind", "diagnosis"],
50
+ },
51
+ }
52
+
53
+
54
+ def _hf_token() -> str | None:
55
+ return (os.environ.get("HF_TOKEN")
56
+ or os.environ.get("HF_TOKEN_PRO")
57
+ or os.environ.get("HF_TOKEN_PRO_WRITE"))
58
+
59
+
60
+ def _post_json(url: str, body: dict, token: str | None) -> dict:
61
+ headers = {"Content-Type": "application/json"}
62
+ if token:
63
+ headers["Authorization"] = f"Bearer {token}"
64
+ req = request.Request(url, data=json.dumps(body).encode(),
65
+ headers=headers, method="POST")
66
+ with request.urlopen(req, timeout=TIMEOUT) as resp:
67
+ return json.loads(resp.read().decode())
68
+
69
+
70
+ def _extract_json(text: str) -> dict | None:
71
+ # Try fenced ```json … ``` first, then loose {...} sweep
72
+ m = re.search(r"```(?:json)?\s*(\{.*?\})\s*```", text, flags=re.S)
73
+ candidates = [m.group(1)] if m else []
74
+ # also try the longest balanced {..} substring
75
+ depth = 0; start = -1; longest = ""
76
+ for i, ch in enumerate(text):
77
+ if ch == "{":
78
+ if depth == 0:
79
+ start = i
80
+ depth += 1
81
+ elif ch == "}":
82
+ depth -= 1
83
+ if depth == 0 and start >= 0:
84
+ blob = text[start:i + 1]
85
+ if len(blob) > len(longest):
86
+ longest = blob
87
+ if longest:
88
+ candidates.append(longest)
89
+ for c in candidates:
90
+ try:
91
+ return json.loads(c)
92
+ except Exception:
93
+ continue
94
+ return None
95
+
96
+
97
+ def _validate(parsed: dict, schema: str) -> tuple[bool, str]:
98
+ spec = SCHEMAS.get(schema)
99
+ if not spec:
100
+ return False, f"unknown schema: {schema}"
101
+ missing = [k for k in spec["required"] if k not in parsed]
102
+ if missing:
103
+ return False, f"missing required keys: {missing}"
104
+ if schema == "diagnosis":
105
+ if parsed.get("fix_kind") not in spec["fix_kind_enum"]:
106
+ return False, f"fix_kind must be one of {spec['fix_kind_enum']}"
107
+ try:
108
+ c = float(parsed.get("confidence", -1))
109
+ if not (0.0 <= c <= 1.0):
110
+ return False, f"confidence out of [0,1]: {c}"
111
+ except Exception:
112
+ return False, "confidence not numeric"
113
+ return True, "ok"
114
+
115
+
116
+ def _call_gradio(space: str, prompt: str, max_tokens: int,
117
+ temperature: float) -> str:
118
+ # Most Surrogate ZeroGPU Spaces expose /run/predict or /api/predict.
119
+ # Try modern /api/predict first, fall back to /run/predict.
120
+ base = f"https://{space.replace('/', '-')}.hf.space"
121
+ body = {"data": [prompt, "", max_tokens, temperature]}
122
+ for path in ("/api/predict", "/run/predict"):
123
+ try:
124
+ r = _post_json(base + path, body, _hf_token())
125
+ if isinstance(r, dict) and "data" in r and r["data"]:
126
+ first = r["data"][0]
127
+ if isinstance(first, str):
128
+ return first
129
+ if isinstance(first, list) and first:
130
+ return str(first[0])
131
+ return json.dumps(r)
132
+ except error.HTTPError as e:
133
+ if e.code in (404, 405):
134
+ continue
135
+ raise
136
+ raise RuntimeError(f"no working endpoint on {base}")
137
+
138
+
139
+ def main() -> int:
140
+ p = argparse.ArgumentParser()
141
+ p.add_argument("--space", required=True,
142
+ help="HF Space repo, e.g. surrogate1/surrogate-1-zero-gpu")
143
+ p.add_argument("--prompt-file", required=True)
144
+ p.add_argument("--schema", required=True, choices=list(SCHEMAS.keys()))
145
+ p.add_argument("--max-tokens", type=int, default=1024)
146
+ p.add_argument("--temperature", type=float, default=0.2)
147
+ p.add_argument("--retries", type=int, default=2)
148
+ p.add_argument("--out", default=None)
149
+ args = p.parse_args()
150
+
151
+ prompt = Path(args.prompt_file).read_text()
152
+ last_err = ""
153
+ for attempt in range(args.retries + 1):
154
+ try:
155
+ raw = _call_gradio(args.space, prompt, args.max_tokens, args.temperature)
156
+ parsed = _extract_json(raw)
157
+ if parsed is None:
158
+ last_err = f"no JSON in response (preview: {raw[:200]})"
159
+ else:
160
+ ok, msg = _validate(parsed, args.schema)
161
+ if ok:
162
+ out = json.dumps(parsed, indent=2)
163
+ print(out)
164
+ if args.out:
165
+ Path(args.out).write_text(out)
166
+ return 0
167
+ last_err = f"schema validation failed: {msg}"
168
+ except Exception as e:
169
+ last_err = f"{type(e).__name__}: {e}"
170
+ if attempt < args.retries:
171
+ time.sleep(BACKOFF * (attempt + 1))
172
+ sys.stderr.write(f"surrogate-call failed: {last_err}\n")
173
+ return 2
174
+
175
+
176
+ if __name__ == "__main__":
177
+ sys.exit(main())
bin/v2/verifier-ensemble.py ADDED
@@ -0,0 +1,404 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 β€” verifier ensemble (single source of truth for "safe to apply").
3
+
4
+ Used by autonomous-sre.sh + autonomous-release.sh BEFORE any action is
5
+ applied to the user's real systems. Returns a JSON verdict; the caller
6
+ applies only if verdict.ok == True.
7
+
8
+ Layers (each returns PASS / FAIL / SKIP):
9
+ 1. ast β€” Python/JS AST parses
10
+ 2. lint β€” ruff (.py) / eslint (.ts/.js) / shellcheck (.sh) / hadolint (Dockerfile)
11
+ / cfn-lint (CF) / tfsec (TF)
12
+ 3. typecheck β€” mypy / tsc / pyright if config present
13
+ 4. tests β€” pytest -k test_<topic> if tests dir present
14
+ 5. policy β€” refuse-list of destructive patterns (rm -rf /, DROP DATABASE,
15
+ iam:* on Resource: "*", DELETE FROM <table> without WHERE…)
16
+ 6. security β€” semgrep --config=p/ci, gitleaks for secrets, prowler if cf/tf
17
+ 7. diff β€” change must be reversible, scoped, ≀ MAX_LINES_CHANGED
18
+ 8. sandbox β€” exec in throwaway docker/E2B if marked executable
19
+ 9. confidence β€” caller passes model logprob; threshold check
20
+
21
+ DECISION:
22
+ ALL non-SKIP must be PASS, AND at least MIN_VERIFIERS_RUN actually executed.
23
+ Any FAIL β†’ ok=False with reasons.
24
+
25
+ Usage:
26
+ verifier-ensemble.py \
27
+ --change /path/to/patch.diff \
28
+ --target /path/to/file/being/changed \
29
+ --kind iac|code|sql|shell \
30
+ --confidence 0.92 \
31
+ --out /tmp/verdict.json
32
+ """
33
+ from __future__ import annotations
34
+
35
+ import argparse
36
+ import json
37
+ import os
38
+ import re
39
+ import shlex
40
+ import subprocess
41
+ import sys
42
+ import tempfile
43
+ from dataclasses import dataclass, field, asdict
44
+ from pathlib import Path
45
+
46
+ MIN_VERIFIERS_RUN = int(os.environ.get("VERIFIER_MIN_RUN", "3"))
47
+ MAX_LINES_CHANGED = int(os.environ.get("VERIFIER_MAX_LINES", "300"))
48
+ CONFIDENCE_FLOOR = float(os.environ.get("VERIFIER_CONFIDENCE_FLOOR", "0.55"))
49
+
50
+ # Hard refuse list β€” patterns that auto-FAIL regardless of other checks.
51
+ # Each entry: (regex, reason). Sourced from research (autonomous-24x7.md
52
+ # Β§HardGuards) β€” 14+ canonical rules. NEVER auto-override these in code.
53
+ REFUSE_PATTERNS = [
54
+ # 1. Filesystem destruction
55
+ (r"\brm\s+-rf\s+/(?!tmp|var/tmp|home/[^/]+/\.surrogate)", "rm -rf on real fs root"),
56
+ (r"\bchmod\s+-R\s+777\s+/(?!tmp)", "chmod 777 outside /tmp"),
57
+ (r"\bchown\s+-R\s+\S+\s+/(?!tmp|home/)", "chown -R on system path"),
58
+ # 2. Database destruction
59
+ (r"\bDROP\s+(DATABASE|TABLE|SCHEMA)\b", "destructive SQL DDL"),
60
+ (r"\bDELETE\s+FROM\b(?![^;]*\bWHERE\b)", "DELETE without WHERE"),
61
+ (r"\bTRUNCATE\s+TABLE\b", "TRUNCATE TABLE"),
62
+ # 3. IaC destructive ops on prod
63
+ (r"\bterraform\s+destroy\b", "terraform destroy"),
64
+ (r"\bterraform\s+(apply|plan).*\bworkspace.*\bprd\b", "terraform on prd workspace"),
65
+ (r"\bcdk\s+destroy\b.*\b(prd|prod)\b", "cdk destroy on prod"),
66
+ # 4. Cloud destructive ops
67
+ (r"\baws\s+s3\s+rb\s+--force\b", "aws s3 rb --force"),
68
+ (r"\baws\s+ec2\s+terminate-instances\b(?!.*--dry-run)", "ec2 terminate w/o dry-run"),
69
+ (r"\baws\s+rds\s+delete-db-instance\b(?!.*--final-db-snapshot-identifier)",
70
+ "rds delete w/o final snapshot"),
71
+ (r"\baws\s+route53\s+change-resource-record-sets\b.*\bDELETE\b", "Route53 DELETE record"),
72
+ # 5. Kubernetes destructive ops
73
+ (r"\bkubectl\s+delete\s+ns\b", "kubectl delete namespace"),
74
+ (r"\bkubectl\s+delete\s+\S+\s+\S*prod\S*\b", "kubectl delete *prod*"),
75
+ (r"\bhelm\s+install\b.*\b(http://|registry\.\S+)\b(?!.*allowlist)",
76
+ "helm install from non-allowlist registry"),
77
+ # 6. Git/source destruction
78
+ (r"\bgit\s+push\s+(--force|--force-with-lease).*\b(main|master|prod)\b",
79
+ "force-push to main/prod"),
80
+ (r"\bgit\s+filter-(branch|repo)\b", "git history rewrite"),
81
+ # 7. IAM / auth weakening
82
+ (r'"Action"\s*:\s*"\*".*"Resource"\s*:\s*"\*"', "IAM Allow * on *"),
83
+ (r'"Effect"\s*:\s*"Allow".*"Principal"\s*:\s*"\*"', "IAM Allow Principal *"),
84
+ (r"\baws\s+iam\s+(delete-user|delete-role|update-assume-role-policy)\b.*\b(admin|root|prod)\b",
85
+ "IAM destructive op on privileged identity"),
86
+ (r"\baws\s+ec2\s+revoke-security-group-(ingress|egress)\b.*\bprod\b",
87
+ "revoke prod SG rule"),
88
+ # 8. Disk / network
89
+ (r"\bdd\s+if=/dev/(zero|random)\s+of=/dev/[shv]d", "raw disk overwrite"),
90
+ (r"\biptables\s+-F\b", "iptables flush"),
91
+ # 9. Untrusted execution
92
+ (r"\b(curl|wget)\b\s+\S+\s*\|\s*(sudo\s+)?(bash|sh|zsh|python\d?)\b",
93
+ "curl | sh from network"),
94
+ (r"\bnpx\s+\S+\b(?!.*--package-lock-only)", "npx of untrusted package"),
95
+ # 10. Secrets in patch (must never land)
96
+ (r"AKIA[0-9A-Z]{16}", "AWS access key in patch"),
97
+ (r"-----BEGIN\s+(RSA|OPENSSH|EC|DSA)\s+PRIVATE\s+KEY-----", "private key in patch"),
98
+ (r"\bsk-[A-Za-z0-9]{32,}", "OpenAI/Anthropic-style API key"),
99
+ (r"\bhf_[A-Za-z0-9]{34}\b", "HuggingFace token in patch"),
100
+ # 11. MFA / security degradation
101
+ (r"\baws\s+iam\s+deactivate-mfa-device\b", "MFA deactivation"),
102
+ (r'"MultiFactorAuthPresent"\s*:\s*\{\s*"Bool"\s*:\s*"false"', "IAM bypass MFA"),
103
+ # 12. Helm / supply-chain risk
104
+ (r"\bdocker\s+pull\s+\S+(?!.*@sha256:)", "docker pull without digest pin"),
105
+ ]
106
+
107
+ # Destructive-class actions require >=0.95 confidence (from research Β§HardGuards)
108
+ DESTRUCTIVE_KEYWORDS = (
109
+ "destroy", "delete", "drop", "truncate", "force-push", "rm -rf",
110
+ "terminate", "revoke", "deactivate-mfa", "filter-branch",
111
+ )
112
+ DESTRUCTIVE_CONFIDENCE_FLOOR = float(
113
+ os.environ.get("VERIFIER_DESTRUCTIVE_FLOOR", "0.95"))
114
+
115
+
116
+ def _is_destructive(change: str) -> bool:
117
+ low = change.lower()
118
+ return any(kw in low for kw in DESTRUCTIVE_KEYWORDS)
119
+
120
+
121
+ @dataclass
122
+ class CheckResult:
123
+ name: str
124
+ status: str # PASS / FAIL / SKIP
125
+ detail: str = ""
126
+
127
+ def passed(self) -> bool:
128
+ return self.status == "PASS"
129
+
130
+ def failed(self) -> bool:
131
+ return self.status == "FAIL"
132
+
133
+
134
+ @dataclass
135
+ class Verdict:
136
+ ok: bool
137
+ reasons: list[str] = field(default_factory=list)
138
+ checks: list[CheckResult] = field(default_factory=list)
139
+ n_pass: int = 0
140
+ n_fail: int = 0
141
+ n_skip: int = 0
142
+
143
+
144
+ def _run(cmd: list[str], timeout: int = 60, cwd: str | None = None) -> tuple[int, str, str]:
145
+ try:
146
+ p = subprocess.run(cmd, capture_output=True, text=True, timeout=timeout, cwd=cwd)
147
+ return p.returncode, p.stdout, p.stderr
148
+ except subprocess.TimeoutExpired:
149
+ return 124, "", "timeout"
150
+ except FileNotFoundError:
151
+ return 127, "", f"binary not found: {cmd[0]}"
152
+
153
+
154
+ def _have(binary: str) -> bool:
155
+ return _run(["which", binary])[0] == 0
156
+
157
+
158
+ # ── Layer 1: AST parse ──────────────────────────────────────────────────────
159
+ def check_ast(target: Path, kind: str) -> CheckResult:
160
+ if not target.exists():
161
+ return CheckResult("ast", "SKIP", "target file does not exist yet")
162
+ if kind == "code" and target.suffix == ".py":
163
+ try:
164
+ import ast
165
+ ast.parse(target.read_text())
166
+ return CheckResult("ast", "PASS", "python AST parses")
167
+ except SyntaxError as e:
168
+ return CheckResult("ast", "FAIL", f"py syntax: {e}")
169
+ if kind == "code" and target.suffix in (".js", ".ts", ".tsx", ".jsx"):
170
+ if _have("node"):
171
+ rc, _, err = _run(["node", "--check", str(target)], timeout=15)
172
+ return CheckResult("ast", "PASS" if rc == 0 else "FAIL", err.strip()[:200] or "ok")
173
+ return CheckResult("ast", "SKIP", "node not installed")
174
+ if kind == "shell" or target.suffix == ".sh":
175
+ rc, _, err = _run(["bash", "-n", str(target)], timeout=15)
176
+ return CheckResult("ast", "PASS" if rc == 0 else "FAIL", err.strip()[:200] or "ok")
177
+ if kind == "iac" and target.suffix in (".yml", ".yaml", ".json"):
178
+ try:
179
+ txt = target.read_text()
180
+ if target.suffix == ".json":
181
+ json.loads(txt)
182
+ else:
183
+ import yaml # type: ignore
184
+ yaml.safe_load(txt)
185
+ return CheckResult("ast", "PASS", "yaml/json parses")
186
+ except Exception as e:
187
+ return CheckResult("ast", "FAIL", f"parse: {e}")
188
+ return CheckResult("ast", "SKIP", f"no AST parser for {target.suffix} (kind={kind})")
189
+
190
+
191
+ # ── Layer 2: lint ───────────────────────────────────────────────────────────
192
+ def check_lint(target: Path, kind: str) -> CheckResult:
193
+ if not target.exists():
194
+ return CheckResult("lint", "SKIP", "no file")
195
+ sx = target.suffix
196
+ if sx == ".py" and _have("ruff"):
197
+ rc, out, _ = _run(["ruff", "check", str(target), "--quiet"], timeout=30)
198
+ return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
199
+ if sx == ".sh" and _have("shellcheck"):
200
+ rc, out, _ = _run(["shellcheck", "-S", "warning", str(target)], timeout=30)
201
+ return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
202
+ if target.name in ("Dockerfile",) and _have("hadolint"):
203
+ rc, out, _ = _run(["hadolint", "--no-fail", str(target)], timeout=30)
204
+ return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300])
205
+ if kind == "iac" and "cf" in str(target).lower() and _have("cfn-lint"):
206
+ rc, out, _ = _run(["cfn-lint", str(target)], timeout=60)
207
+ return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
208
+ if kind == "iac" and target.suffix == ".tf" and _have("tflint"):
209
+ rc, out, _ = _run(["tflint", str(target)], timeout=60)
210
+ return CheckResult("lint", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "clean")
211
+ return CheckResult("lint", "SKIP", "no linter for file type or binary missing")
212
+
213
+
214
+ # ── Layer 3: typecheck ──────────────────────────────────────────────────────
215
+ def check_typecheck(target: Path, kind: str) -> CheckResult:
216
+ if not target.exists() or kind != "code":
217
+ return CheckResult("typecheck", "SKIP", "n/a")
218
+ if target.suffix == ".py" and _have("mypy"):
219
+ rc, out, _ = _run(["mypy", "--ignore-missing-imports", "--no-error-summary",
220
+ str(target)], timeout=45)
221
+ return CheckResult("typecheck", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "ok")
222
+ if target.suffix in (".ts", ".tsx") and _have("tsc"):
223
+ rc, out, _ = _run(["tsc", "--noEmit", "--allowJs", str(target)], timeout=60)
224
+ return CheckResult("typecheck", "PASS" if rc == 0 else "FAIL", out.strip()[:300] or "ok")
225
+ return CheckResult("typecheck", "SKIP", "no typechecker available")
226
+
227
+
228
+ # ── Layer 4: tests ──────────────────────────────────────────────────────────
229
+ def check_tests(target: Path, kind: str) -> CheckResult:
230
+ repo = target.parent
231
+ while repo != repo.parent and not (repo / ".git").exists():
232
+ repo = repo.parent
233
+ if not (repo / ".git").exists():
234
+ return CheckResult("tests", "SKIP", "not a git repo")
235
+ test_dir = next((repo / d for d in ("tests", "test", "__tests__") if (repo / d).is_dir()), None)
236
+ if test_dir is None:
237
+ return CheckResult("tests", "SKIP", "no tests/ dir")
238
+ if _have("pytest"):
239
+ rc, out, _ = _run(["pytest", "-x", "--tb=line", "-q", str(test_dir)],
240
+ timeout=180, cwd=str(repo))
241
+ return CheckResult("tests", "PASS" if rc == 0 else "FAIL",
242
+ out.strip().splitlines()[-1][:200] if out else "no output")
243
+ return CheckResult("tests", "SKIP", "pytest not installed")
244
+
245
+
246
+ # ── Layer 5: policy (refuse-list) ───────────────────────────────────────────
247
+ def check_policy(change: str) -> CheckResult:
248
+ hits = []
249
+ for pat, reason in REFUSE_PATTERNS:
250
+ if re.search(pat, change, flags=re.IGNORECASE):
251
+ hits.append(reason)
252
+ if hits:
253
+ return CheckResult("policy", "FAIL", f"refused: {'; '.join(hits)}")
254
+ return CheckResult("policy", "PASS", "no refuse-list patterns matched")
255
+
256
+
257
+ # ── Layer 6: security ───────────────────────────────────────────────────────
258
+ def check_security(target: Path, change: str) -> CheckResult:
259
+ detail = []
260
+ # secrets β€” gitleaks if available, else regex fallback
261
+ if _have("gitleaks"):
262
+ with tempfile.NamedTemporaryFile("w", suffix=".diff", delete=False) as f:
263
+ f.write(change); patch = f.name
264
+ rc, out, _ = _run(["gitleaks", "detect", "--no-git", "--source", patch,
265
+ "--report-format", "json"], timeout=30)
266
+ if rc != 0 and out.strip() and out.strip() != "[]":
267
+ detail.append(f"gitleaks hit: {out[:200]}")
268
+ else:
269
+ for pat in (r"AKIA[0-9A-Z]{16}", r"AIza[0-9A-Za-z\-_]{35}",
270
+ r"sk-[a-zA-Z0-9]{32,}", r"hf_[a-zA-Z0-9]{34}"):
271
+ if re.search(pat, change):
272
+ detail.append(f"secret pattern: {pat[:20]}…")
273
+ # semgrep
274
+ if _have("semgrep") and target.exists():
275
+ rc, out, _ = _run(["semgrep", "--config=p/ci", "--quiet", "--error",
276
+ "--timeout", "30", str(target)], timeout=90)
277
+ if rc not in (0, 1): # 1 = findings, accept; >1 = real error
278
+ detail.append(f"semgrep err: {out[:120]}")
279
+ elif rc == 1:
280
+ detail.append(f"semgrep findings: {out.strip().splitlines()[-1][:150]}")
281
+ # iac scanners
282
+ if "cf" in str(target).lower() and _have("cfn-guard"):
283
+ rules = os.environ.get("CFN_GUARD_RULES", "")
284
+ if rules:
285
+ rc, out, _ = _run(["cfn-guard", "validate", "-d", str(target), "-r", rules],
286
+ timeout=60)
287
+ if rc != 0:
288
+ detail.append(f"cfn-guard: {out[:200]}")
289
+ if not detail:
290
+ return CheckResult("security", "PASS", "no findings")
291
+ return CheckResult("security", "FAIL", " | ".join(detail))
292
+
293
+
294
+ # ── Layer 7: diff sanity ────────────────────────────────────────────────────
295
+ def check_diff(change: str) -> CheckResult:
296
+ lines = change.splitlines()
297
+ add = sum(1 for L in lines if L.startswith("+") and not L.startswith("+++"))
298
+ rem = sum(1 for L in lines if L.startswith("-") and not L.startswith("---"))
299
+ total = add + rem
300
+ if total == 0:
301
+ return CheckResult("diff", "FAIL", "empty diff")
302
+ if total > MAX_LINES_CHANGED:
303
+ return CheckResult("diff", "FAIL",
304
+ f"{total} lines changed > limit {MAX_LINES_CHANGED}")
305
+ files_changed = sum(1 for L in lines if L.startswith("+++ b/"))
306
+ if files_changed > 8:
307
+ return CheckResult("diff", "FAIL", f"{files_changed} files in one change > 8")
308
+ return CheckResult("diff", "PASS", f"+{add}/-{rem} lines, {files_changed} files")
309
+
310
+
311
+ # ── Layer 8: sandbox exec (best-effort) ─────────────────────────────────────
312
+ def check_sandbox(target: Path, kind: str) -> CheckResult:
313
+ if kind != "shell" or target.suffix != ".sh" or not target.exists():
314
+ return CheckResult("sandbox", "SKIP", "not a shell script or no target")
315
+ if not _have("docker"):
316
+ # Fall back to bash subshell with restricted env, no network
317
+ rc, out, err = _run(["env", "-i", "PATH=/usr/bin:/bin",
318
+ "bash", "-c", f"set -e; bash -n {shlex.quote(str(target))}"],
319
+ timeout=10)
320
+ return CheckResult("sandbox", "PASS" if rc == 0 else "FAIL",
321
+ (err or out).strip()[:200] or "ok-no-exec")
322
+ # docker β€” run in network=none, read-only, dropped caps
323
+ rc, out, err = _run([
324
+ "docker", "run", "--rm", "--network=none", "--read-only",
325
+ "--cap-drop=ALL", "--memory=256m", "--cpus=0.5",
326
+ "-v", f"{target}:/script.sh:ro",
327
+ "alpine:3.20", "sh", "-c", "bash /script.sh --dry-run --help 2>&1 | head -20",
328
+ ], timeout=30)
329
+ return CheckResult("sandbox", "PASS" if rc == 0 else "FAIL",
330
+ (out or err).strip()[:200] or "ok")
331
+
332
+
333
+ # ── Layer 9: confidence (with destructive-class escalation) ────────────────
334
+ def check_confidence(conf: float | None, change: str) -> CheckResult:
335
+ if conf is None:
336
+ return CheckResult("confidence", "SKIP", "no confidence supplied")
337
+ floor = CONFIDENCE_FLOOR
338
+ if _is_destructive(change):
339
+ floor = max(floor, DESTRUCTIVE_CONFIDENCE_FLOOR)
340
+ suffix = " (destructive-class)"
341
+ else:
342
+ suffix = ""
343
+ if conf < floor:
344
+ return CheckResult("confidence", "FAIL",
345
+ f"{conf:.2f} below floor {floor}{suffix}")
346
+ return CheckResult("confidence", "PASS", f"{conf:.2f} β‰₯ {floor}{suffix}")
347
+
348
+
349
+ # ── Orchestrator ────────────────────────────────────────────────────────────
350
+ def verify(change: str, target: Path, kind: str, confidence: float | None) -> Verdict:
351
+ checks = [
352
+ check_diff(change), # 7
353
+ check_policy(change), # 5 β€” fail-fast hard
354
+ check_ast(target, kind), # 1
355
+ check_lint(target, kind), # 2
356
+ check_typecheck(target, kind), # 3
357
+ check_tests(target, kind), # 4
358
+ check_security(target, change), # 6
359
+ check_sandbox(target, kind), # 8
360
+ check_confidence(confidence, change), # 9 (with destructive escalation)
361
+ ]
362
+ n_pass = sum(c.passed() for c in checks)
363
+ n_fail = sum(c.failed() for c in checks)
364
+ n_skip = sum(c.status == "SKIP" for c in checks)
365
+ reasons = [f"{c.name}: {c.detail}" for c in checks if c.failed()]
366
+ n_run = n_pass + n_fail
367
+ ok = (n_fail == 0) and (n_run >= MIN_VERIFIERS_RUN)
368
+ if not ok and n_run < MIN_VERIFIERS_RUN:
369
+ reasons.append(f"only {n_run} verifiers ran (min {MIN_VERIFIERS_RUN}) β€” install missing tools")
370
+ return Verdict(ok=ok, reasons=reasons, checks=checks,
371
+ n_pass=n_pass, n_fail=n_fail, n_skip=n_skip)
372
+
373
+
374
+ def main() -> int:
375
+ p = argparse.ArgumentParser()
376
+ p.add_argument("--change", required=True,
377
+ help="path to unified-diff or raw patch text")
378
+ p.add_argument("--target", required=True,
379
+ help="primary file the change applies to")
380
+ p.add_argument("--kind", required=True, choices=["code", "iac", "sql", "shell"])
381
+ p.add_argument("--confidence", type=float, default=None,
382
+ help="model logprob-derived confidence in [0,1]")
383
+ p.add_argument("--out", default=None, help="write verdict JSON to this path")
384
+ args = p.parse_args()
385
+
386
+ change_path = Path(args.change)
387
+ change_txt = change_path.read_text() if change_path.exists() else args.change
388
+ verdict = verify(change_txt, Path(args.target), args.kind, args.confidence)
389
+ j = json.dumps({
390
+ "ok": verdict.ok,
391
+ "reasons": verdict.reasons,
392
+ "n_pass": verdict.n_pass,
393
+ "n_fail": verdict.n_fail,
394
+ "n_skip": verdict.n_skip,
395
+ "checks": [asdict(c) for c in verdict.checks],
396
+ }, indent=2)
397
+ print(j)
398
+ if args.out:
399
+ Path(args.out).write_text(j)
400
+ return 0 if verdict.ok else 1
401
+
402
+
403
+ if __name__ == "__main__":
404
+ sys.exit(main())
bin/v2/watchdog.sh ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 β€” watchdog (research Β§autonomous-24x7 pattern 6).
3
+ #
4
+ # Runs as a SEPARATE process from autonomous-sre / autonomous-release. Its
5
+ # only job is to observe outcomes.jsonl + the daemon processes and KILL
6
+ # them if it detects:
7
+ # - Loop / runaway (same trigger fires N times in M minutes)
8
+ # - Audit-gap (daemon logs an "applied" outcome but verifier was missing)
9
+ # - Failure cascade (β‰₯X consecutive rollback/error outcomes)
10
+ # - Disk fill (state dir > $STATE_GB_LIMIT GB)
11
+ # - Outcome rate spike (β‰₯X outcomes in 1 min β€” likely runaway)
12
+ #
13
+ # Watchdog must NEVER apply patches itself (no Surrogate calls, no patch
14
+ # tool). It only observes and kills. Restart of daemons is a human
15
+ # decision after reading the kill reason.
16
+ #
17
+ # Usage (run on a machine separate from the daemons in the hardened setup;
18
+ # for now we run it as a sibling process):
19
+ # nohup bash bin/v2/watchdog.sh \
20
+ # > $HOME/.surrogate/logs/watchdog.log 2>&1 &
21
+ set -uo pipefail
22
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
23
+
24
+ STATE="$HOME/.surrogate/state"
25
+ OUTCOMES="$STATE/outcomes.jsonl"
26
+ LOG="$HOME/.surrogate/logs/watchdog.log"
27
+ KILLED="$STATE/watchdog-killed"
28
+ mkdir -p "$STATE" "$(dirname "$LOG")"
29
+
30
+ INTERVAL_SEC="${WD_INTERVAL_SEC:-60}" # check every minute
31
+ LOOP_THRESHOLD_N="${WD_LOOP_N:-5}" # same trigger β‰₯5Γ—
32
+ LOOP_WINDOW_MIN="${WD_LOOP_WIN_MIN:-15}" # in 15 min
33
+ CASCADE_THRESHOLD="${WD_CASCADE_N:-5}" # β‰₯5 consecutive failures
34
+ RATE_SPIKE_PER_MIN="${WD_RATE_SPIKE:-30}" # β‰₯30 outcomes/min
35
+ STATE_GB_LIMIT="${WD_STATE_GB:-5}"
36
+ DAEMONS=(
37
+ "autonomous-sre.sh"
38
+ "autonomous-release.sh"
39
+ "auto-swap-and-bench.sh"
40
+ )
41
+
42
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
43
+ notify() {
44
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
45
+ curl -s -X POST -H "Content-Type: application/json" \
46
+ -d "{\"content\":\"🚨 watchdog: $1\"}" \
47
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
48
+ }
49
+
50
+ kill_daemons() {
51
+ local reason="$1"
52
+ log "═══ KILL: $reason ═══"
53
+ notify "KILL β€” $reason"
54
+ : > "$KILLED"; date -u +%Y-%m-%dT%H:%M:%SZ >> "$KILLED"
55
+ echo "$reason" >> "$KILLED"
56
+ for d in "${DAEMONS[@]}"; do
57
+ if pgrep -f "$d" >/dev/null; then
58
+ log " pkill -f $d"
59
+ pkill -f "$d" || true
60
+ fi
61
+ done
62
+ sleep 5
63
+ for d in "${DAEMONS[@]}"; do
64
+ if pgrep -f "$d" >/dev/null; then
65
+ log " pkill -9 -f $d (still alive)"
66
+ pkill -9 -f "$d" || true
67
+ fi
68
+ done
69
+ }
70
+
71
+ # Detect: same trigger fires N times in M minutes
72
+ check_loop() {
73
+ [[ ! -f "$OUTCOMES" ]] && return 0
74
+ python3 - <<PYEOF
75
+ import json, datetime as dt, collections, sys
76
+ cutoff = dt.datetime.now(dt.timezone.utc) - dt.timedelta(minutes=$LOOP_WINDOW_MIN)
77
+ recent = collections.Counter()
78
+ for L in open("$OUTCOMES"):
79
+ try: r = json.loads(L)
80
+ except: continue
81
+ try:
82
+ ts = dt.datetime.strptime(r["ts"], "%Y-%m-%dT%H:%M:%SZ")
83
+ except: continue
84
+ if ts < cutoff: continue
85
+ recent[r.get("trigger","?")] += 1
86
+ for trig, n in recent.items():
87
+ if n >= $LOOP_THRESHOLD_N:
88
+ sys.exit(11) # loop detected
89
+ sys.exit(0)
90
+ PYEOF
91
+ return $?
92
+ }
93
+
94
+ # Detect: β‰₯X consecutive non-success outcomes
95
+ check_cascade() {
96
+ [[ ! -f "$OUTCOMES" ]] && return 0
97
+ python3 - <<PYEOF
98
+ import json, sys
99
+ streak = 0
100
+ recent = []
101
+ for L in open("$OUTCOMES"):
102
+ try: r = json.loads(L)
103
+ except: continue
104
+ recent.append(r)
105
+ recent = recent[-$CASCADE_THRESHOLD:]
106
+ if len(recent) < $CASCADE_THRESHOLD:
107
+ sys.exit(0)
108
+ if all(r.get("outcome") in ("rollback","error") for r in recent):
109
+ sys.exit(12)
110
+ sys.exit(0)
111
+ PYEOF
112
+ return $?
113
+ }
114
+
115
+ # Detect: outcome rate spike (>X in last minute)
116
+ check_rate_spike() {
117
+ [[ ! -f "$OUTCOMES" ]] && return 0
118
+ python3 - <<PYEOF
119
+ import json, datetime as dt, sys
120
+ cutoff = dt.datetime.now(dt.timezone.utc) - dt.timedelta(minutes=1)
121
+ n = 0
122
+ for L in open("$OUTCOMES"):
123
+ try: r = json.loads(L)
124
+ except: continue
125
+ try:
126
+ ts = dt.datetime.strptime(r["ts"], "%Y-%m-%dT%H:%M:%SZ")
127
+ except: continue
128
+ if ts >= cutoff: n += 1
129
+ if n >= $RATE_SPIKE_PER_MIN:
130
+ sys.exit(13)
131
+ sys.exit(0)
132
+ PYEOF
133
+ return $?
134
+ }
135
+
136
+ # Detect: applied without a verdict (audit gap)
137
+ check_audit_gap() {
138
+ [[ ! -f "$OUTCOMES" ]] && return 0
139
+ python3 - <<'PYEOF'
140
+ import json, sys, os
141
+ gaps = 0
142
+ with open(os.environ["OUTCOMES"]) as f:
143
+ for L in f.readlines()[-50:]:
144
+ try: r = json.loads(L)
145
+ except: continue
146
+ if r.get("applied") and not r.get("verdict"):
147
+ gaps += 1
148
+ if gaps >= 3:
149
+ sys.exit(14)
150
+ sys.exit(0)
151
+ PYEOF
152
+ return $?
153
+ }
154
+ export OUTCOMES
155
+
156
+ # Detect: state dir disk fill
157
+ check_disk() {
158
+ local kb gb
159
+ kb=$(du -sk "$STATE" 2>/dev/null | awk '{print $1}')
160
+ gb=$(( kb / 1048576 ))
161
+ if (( gb > STATE_GB_LIMIT )); then
162
+ log "state dir = ${gb}GB > limit ${STATE_GB_LIMIT}GB"
163
+ return 15
164
+ fi
165
+ return 0
166
+ }
167
+
168
+ log "═══ watchdog starting (interval=${INTERVAL_SEC}s) ═══"
169
+ notify "watchdog online"
170
+
171
+ while true; do
172
+ if [[ -f "$KILLED" ]]; then
173
+ log "killed marker present β€” staying dormant. Remove $KILLED to re-arm."
174
+ sleep "$INTERVAL_SEC"
175
+ continue
176
+ fi
177
+
178
+ rc=0
179
+ check_loop || rc=$?
180
+ [[ $rc -eq 11 ]] && { kill_daemons "LOOP detected (β‰₯$LOOP_THRESHOLD_N same trigger in ${LOOP_WINDOW_MIN}m)"; continue; }
181
+
182
+ check_cascade || rc=$?
183
+ [[ $rc -eq 12 ]] && { kill_daemons "CASCADE: $CASCADE_THRESHOLD consecutive rollback/error"; continue; }
184
+
185
+ check_rate_spike || rc=$?
186
+ [[ $rc -eq 13 ]] && { kill_daemons "RATE SPIKE: β‰₯$RATE_SPIKE_PER_MIN outcomes in 60s"; continue; }
187
+
188
+ check_audit_gap || rc=$?
189
+ [[ $rc -eq 14 ]] && { kill_daemons "AUDIT GAP: β‰₯3 applied actions without verdict"; continue; }
190
+
191
+ check_disk || rc=$?
192
+ [[ $rc -eq 15 ]] && { kill_daemons "DISK: $STATE >${STATE_GB_LIMIT}GB"; continue; }
193
+
194
+ sleep "$INTERVAL_SEC"
195
+ done