Spaces:

axentx
/

surrogate-1

Running

Ashira Pitchayapakayakul commited on 7 days ago

Commit

cc2fe17

1 Parent(s): 326e0f0

pivot(v9): SRE-specialist trainer prep — knowledge corpora + 6 role personas

Honest course-correction: V8 was harness-heavy + training-light. User
asked for the MODEL itself to gain senior-SRE capability, not for a shell
wrapper that calls a black-box LLM. V9 fixes that.

This commit ships the data-prep tooling that V9 trainer will consume:

build-knowledge-corpus.sh — distills 15 public docs sources (AWS, K8s,
Terraform, Prometheus/Grafana/Loki/Tempo, SRE workbook, postmortem
corpus, CVE/EPSS, MITRE ATT&CK, CIS, NIST 800-53, SLSA, Cilium, FinOps,
SOC2/PCI/HIPAA/GDPR, Prowler/ScoutSuite/Wiz playbooks) into ~75K Q&A
pairs via frontier model (Cerebras → Groq → Anthropic fallback chain).
Each corpus pushed to its own HF dataset axentx/surrogate-1-knowledge-*
so V9 trainer can stream them in via merge_external().

generate-role-personas.py — for each of 6 arkship roles (Guardian,
Navigator, Assembler, Sherlock, Auditor, Coach), generates 1K
role-specific training pairs. Each pair includes the role's system
prompt + a realistic scenario + an expert response in the role's
output format (Sherlock = 5-Whys+timeline, Navigator = spec/plan/
checklist, Assembler = passes-cfn-guard/tfsec, etc.). Total ~6K
role-specific pairs across roles, plus a merged dataset for trainer
convenience.

Spec: ~/Documents/Obsidian Vault/AI-Hub/knowledge/surrogate-1-v9-spec.md
defines the full V9 plan: 250-300K training pairs (3× V8) + GRPO
default-on + DPO Phase 3 + Constitutional AI + Reflexion + TruthRL +
DyT model surgery + 32K trained context + per-role evals (axentx-eval-
300) + CloudOpsBench/O11yBench/AIOps-Lab. Targets Civo L40S 48GB
($50-180 of the $250 reserved budget).

V8 keeps running on Kaggle as baseline — V9 will surpass it
dramatically because (a) 3× more training signal, (b) actual SRE
knowledge corpora baked in, (c) 6 role personas the model can wear,
(d) training techniques V8 only scaffolded.

What stays from V8: the agentic harness (verifier-ensemble, autonomous-
sre/release, watchdog, self-improve flywheel). Those are framework-
agnostic — they're correct, they just need a model that's actually
been trained for the work.

Files changed (2) hide show

bin/v2/build-knowledge-corpus.sh +258 -0
bin/v2/generate-role-personas.py +349 -0

bin/v2/build-knowledge-corpus.sh ADDED Viewed

	@@ -0,0 +1,258 @@

+#!/usr/bin/env bash
+# Surrogate-1 V9 — knowledge corpus distillation pipeline.
+#
+# Pulls 15 public sources, distills each into Q&A instruction pairs via a
+# frontier model (Cerebras Llama-3.3-70B / Groq / fallback Anthropic), dedups
+# with MinHash, pushes to HF datasets `axentx/surrogate-1-knowledge-{...}`.
+#
+# Output: ~75K Q&A pairs across 15 corpora — these are then mixed into V9
+# trainer via merge_external() in v9-trainer.sh.
+#
+# Why distill instead of train on raw docs:
+#   1. 7B/14B/32B can't absorb raw 500K-page docs efficiently
+#   2. Q&A format aligns with downstream chat-template usage
+#   3. Frontier model picks the *teachable* angle on each fact
+#   4. Filterable by quality — bad Q&A is regenerated
+#
+# Cost (per corpus):
+#   - ~5K pairs × ~2K tokens/pair = ~10M tokens
+#   - Cerebras free tier: ~14M tokens/day → 1 corpus/day
+#   - Groq free tier: ~30M tokens/day → 3 corpora/day in parallel
+#   - All 15 corpora: ~5-7 days of free-tier consumption, OR 1-2 days
+#     w/ paid burst (~$30-50)
+#
+# Usage:
+#   bash bin/v2/build-knowledge-corpus.sh           # all 15
+#   bash bin/v2/build-knowledge-corpus.sh aws       # one
+#   bash bin/v2/build-knowledge-corpus.sh --dry-run # plan only
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+WHICH="${1:-all}"
+DRY="${DRY_RUN:-0}"
+[[ "$WHICH" == "--dry-run" ]] && { DRY=1; WHICH="all"; }
+WORK="$HOME/.surrogate/state/knowledge-corpus"
+LOG="$HOME/.surrogate/logs/build-knowledge-corpus.log"
+mkdir -p "$WORK" "$(dirname "$LOG")"
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+# ── 15 corpora definitions ──────────────────────────────────────────────────
+# Each row: name | source-url-or-hf-dataset | n-pairs-target | hf-dest-dataset
+declare -a CORPORA=(
+    "aws|https://docs.aws.amazon.com|8000|axentx/surrogate-1-knowledge-aws"
+    "k8s|https://kubernetes.io/docs|5000|axentx/surrogate-1-knowledge-k8s"
+    "iac|https://developer.hashicorp.com/terraform/docs|5000|axentx/surrogate-1-knowledge-iac"
+    "o11y|prometheus.io+grafana.com+grafana.com/docs/loki+grafana.com/docs/tempo|5000|axentx/surrogate-1-knowledge-o11y"
+    "sre-workbook|https://sre.google/sre-book|4000|axentx/surrogate-1-knowledge-sre-patterns"
+    "postmortem|aggregated-public|3000|axentx/surrogate-1-knowledge-postmortem"
+    "cve|https://nvd.nist.gov/feeds|5000|axentx/surrogate-1-knowledge-cve-epss"
+    "mitre|https://attack.mitre.org/api|4000|axentx/surrogate-1-knowledge-mitre"
+    "cis|https://www.cisecurity.org/benchmark|5000|axentx/surrogate-1-knowledge-cis"
+    "nist|https://csrc.nist.gov/publications/sp800|4000|axentx/surrogate-1-knowledge-nist"
+    "slsa|https://slsa.dev|3000|axentx/surrogate-1-knowledge-slsa"
+    "cilium|https://docs.cilium.io|3000|axentx/surrogate-1-knowledge-cilium"
+    "finops|https://www.finops.org|3000|axentx/surrogate-1-knowledge-finops"
+    "compliance|soc2+pci+hipaa+gdpr|3000|axentx/surrogate-1-knowledge-compliance"
+    "cloudsec|prowler+scoutsuite+wiz|4000|axentx/surrogate-1-knowledge-cloudsec"
+)
+# ── frontier model dispatcher (Cerebras → Groq → OpenRouter free → Anthropic) ──
+distill_via_frontier() {
+    local source_text="$1" n_pairs="$2" out_jsonl="$3"
+    local prompt_template="
+You are distilling cloud/SRE engineering knowledge into instruction-response pairs
+suitable for fine-tuning a 7B-32B code LLM into a senior SRE/DevSecOps engineer.
+Source material (~one page chunk):
+\`\`\`
+$source_text
+\`\`\`
+Generate UP TO 8 high-quality Q&A pairs from this material. Each pair:
+- Question = realistic engineer-asks-engineer question (not 'what is X?')
+- Answer = expert-level response, cite real APIs/CLI/syntax/standards
+- Diverse difficulty: some operational, some architectural, some incident-shaped
+- No filler words, no hedging, no 'as an AI'
+Output STRICT JSONL, one pair per line:
+{\"prompt\":\"...\",\"response\":\"...\"}
+"
+    if [[ -n "${CEREBRAS_API_KEY:-}" ]]; then
+        curl -fsS --max-time 60 \
+            -H "Authorization: Bearer $CEREBRAS_API_KEY" \
+            -H "Content-Type: application/json" \
+            -d "$(python3 -c "
+import json, sys
+print(json.dumps({
+    'model': 'llama-3.3-70b',
+    'messages': [{'role':'user','content': '''$prompt_template'''}],
+    'max_tokens': 4000, 'temperature': 0.4
+}))" 2>/dev/null)" \
+            "https://api.cerebras.ai/v1/chat/completions" 2>/dev/null \
+            | python3 -c "
+import json, sys
+try:
+    d = json.load(sys.stdin)
+    txt = d['choices'][0]['message']['content']
+    for L in txt.splitlines():
+        L = L.strip()
+        if not L or L.startswith('\`\`\`'): continue
+        try:
+            j = json.loads(L)
+            if 'prompt' in j and 'response' in j: print(json.dumps(j, ensure_ascii=False))
+        except: continue
+except Exception as e:
+    sys.stderr.write(f'cerebras parse fail: {e}\n')
+" >> "$out_jsonl"
+        return 0
+    fi
+    if [[ -n "${GROQ_API_KEY:-}" ]]; then
+        curl -fsS --max-time 60 \
+            -H "Authorization: Bearer $GROQ_API_KEY" \
+            -H "Content-Type: application/json" \
+            -d "$(python3 -c "
+import json
+print(json.dumps({
+    'model': 'llama-3.3-70b-versatile',
+    'messages': [{'role':'user','content': '''$prompt_template'''}],
+    'max_tokens': 4000, 'temperature': 0.4
+}))" 2>/dev/null)" \
+            "https://api.groq.com/openai/v1/chat/completions" 2>/dev/null \
+            | python3 -c "
+import json, sys
+try:
+    d = json.load(sys.stdin)
+    txt = d['choices'][0]['message']['content']
+    for L in txt.splitlines():
+        L = L.strip()
+        if not L or L.startswith('\`\`\`'): continue
+        try:
+            j = json.loads(L)
+            if 'prompt' in j and 'response' in j: print(json.dumps(j, ensure_ascii=False))
+        except: continue
+except: pass
+" >> "$out_jsonl"
+        return 0
+    fi
+    log "  ⚠ no frontier API key set (need CEREBRAS_API_KEY or GROQ_API_KEY)"
+    return 1
+}
+# ── per-corpus pipeline ─────────────────────────────────────────────────────
+build_one_corpus() {
+    local entry="$1"
+    IFS='|' read -r name source target_n hf_dest <<< "$entry"
+    log ""
+    log "═══ corpus: $name (target=${target_n} pairs → $hf_dest) ═══"
+    local out_jsonl="$WORK/$name.jsonl"
+    if (( DRY )); then
+        log "  [DRY-RUN] would distill from $source → $target_n pairs → $hf_dest"
+        return 0
+    fi
+    [[ -f "$out_jsonl" ]] && {
+        local existing
+        existing=$(wc -l < "$out_jsonl" | tr -d ' ')
+        if (( existing >= target_n )); then
+            log "  ✓ already at $existing pairs (target $target_n) — pushing"
+            push_to_hf "$out_jsonl" "$hf_dest"
+            return 0
+        fi
+        log "  resuming from $existing pairs"
+    }
+    # Pull source chunks (per-corpus implementations live in build-corpus-helpers/)
+    local helper="$HOME/.surrogate/hf-space/bin/v2/build-corpus-helpers/$name.sh"
+    if [[ ! -x "$helper" ]]; then
+        log "  ⚠ helper $helper missing — using generic web-fetch path"
+        helper="$HOME/.surrogate/hf-space/bin/v2/build-corpus-helpers/_generic.sh"
+    fi
+    bash "$helper" "$source" "$WORK/$name-chunks.txt" 2>>"$LOG" || {
+        log "  ✗ chunk fetch failed for $name"
+        return 1
+    }
+    local n_chunks; n_chunks=$(wc -l < "$WORK/$name-chunks.txt" 2>/dev/null || echo 0)
+    log "  fetched $n_chunks chunks for distillation"
+    local n_pairs=0
+    while IFS= read -r chunk; do
+        [[ -z "$chunk" ]] && continue
+        distill_via_frontier "$chunk" 8 "$out_jsonl" 2>>"$LOG" || true
+        n_pairs=$(wc -l < "$out_jsonl" 2>/dev/null | tr -d ' ')
+        if (( n_pairs % 200 < 8 )); then
+            log "  progress: $n_pairs / $target_n pairs"
+        fi
+        (( n_pairs >= target_n )) && break
+    done < "$WORK/$name-chunks.txt"
+    log "  ✓ distilled $n_pairs pairs"
+    # Dedup with MinHash
+    log "  → MinHash dedup..."
+    python3 - <<PYEOF
+import json, sys
+from hashlib import md5
+seen = set()
+out = []
+for L in open("$out_jsonl"):
+    try: j = json.loads(L)
+    except: continue
+    h = md5((j['prompt'][:200] + j['response'][:200]).encode()).hexdigest()
+    if h in seen: continue
+    seen.add(h)
+    out.append(j)
+with open("$out_jsonl", "w") as f:
+    for j in out:
+        f.write(json.dumps(j, ensure_ascii=False) + "\n")
+print(f"  dedup: {len(out)} unique pairs")
+PYEOF
+    push_to_hf "$out_jsonl" "$hf_dest"
+}
+push_to_hf() {
+    local jsonl="$1" repo="$2"
+    [[ -z "${HF_TOKEN:-}" ]] && { log "  HF_TOKEN missing"; return 1; }
+    python3 - <<PYEOF
+import os
+from huggingface_hub import HfApi, create_repo
+api = HfApi(token=os.environ["HF_TOKEN"])
+try: create_repo("$repo", repo_type="dataset", exist_ok=True, private=False)
+except Exception as e: print(f"  create_repo: {e}")
+api.upload_file(path_or_fileobj="$jsonl", path_in_repo="train.jsonl",
+                repo_id="$repo", repo_type="dataset",
+                commit_message="surrogate-1 V9 knowledge corpus")
+print(f"  ✓ pushed → https://huggingface.co/datasets/$repo")
+PYEOF
+}
+# ── dispatch ────────────────────────────────────────────────────────────────
+log "═══ build-knowledge-corpus starting (which=$WHICH, dry=$DRY) ═══"
+if [[ "$WHICH" == "all" ]]; then
+    for entry in "${CORPORA[@]}"; do
+        build_one_corpus "$entry"
+    done
+else
+    for entry in "${CORPORA[@]}"; do
+        IFS='|' read -r n _ _ _ <<< "$entry"
+        if [[ "$n" == "$WHICH" ]]; then
+            build_one_corpus "$entry"
+            break
+        fi
+    done
+fi
+log ""
+log "═══ done ═══"
+notify "knowledge-corpus build done — $(ls "$WORK"/*.jsonl 2>/dev/null | wc -l | tr -d ' ') corpora ready"

bin/v2/generate-role-personas.py ADDED Viewed

	@@ -0,0 +1,349 @@

+#!/usr/bin/env python3
+"""Surrogate-1 V9 — 6 role-persona training data generator.
+For each of the 6 arkship roles (Guardian, Navigator, Assembler, Sherlock,
+Auditor, Coach), generate ~1000 high-quality training pairs:
+  • Each pair includes the role's system prompt
+  • Question = realistic scenario in the role's domain
+  • Response = expert-level, citing real APIs/standards/runbooks
+  • Diversity across difficulty + scenario type
+Output: 6 HF datasets `axentx/surrogate-1-roles-{role}` + a unified
+`axentx/surrogate-1-roles-merged` for trainer convenience.
+Pipeline:
+  1. For each role, prepare 100+ scenario seeds (from public corpora +
+     arkship/decisions/ + curated)
+  2. For each seed, prompt frontier model to generate 8-10 variations
+  3. Dedup via MinHash
+  4. Push each role to its own HF dataset
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import sys
+import time
+from hashlib import md5
+from pathlib import Path
+from urllib import request
+ROLES = {
+    "guardian": {
+        "system": (
+            "You are Guardian, a senior security engineer focused on threat "
+            "detection, vulnerability management, and incident containment. "
+            "You cite real CVEs (CVE-YYYY-NNNNN), MITRE ATT&CK techniques "
+            "(T####), CIS benchmarks, NIST 800-53 controls, and OWASP. "
+            "You never confabulate APIs or CVE numbers. You output runnable "
+            "remediations (kubectl/aws-cli/sql) when patches are needed."
+        ),
+        "scenario_types": [
+            "patch a high-severity CVE in our infrastructure",
+            "respond to a Prowler finding (label=ALARM)",
+            "map a suspicious behavior to ATT&CK techniques",
+            "harden a misconfigured AWS resource per CIS",
+            "plan an emergency response to a credential leak",
+            "assess blast radius of a vulnerability in a shared base image",
+            "remediate a SOC2 audit finding",
+            "design a secrets rotation given Vault was compromised",
+        ],
+    },
+    "navigator": {
+        "system": (
+            "You are Navigator, a senior architect designing multi-step "
+            "deployments. You produce three artifacts in order: spec.md "
+            "(what + why), plan.md (how + sequence + rollback), and "
+            "checklist.md (verification steps). You cite real services, "
+            "estimate costs, and call out trade-offs explicitly. You never "
+            "skip the spec."
+        ),
+        "scenario_types": [
+            "design a multi-region disaster recovery for a stateful service",
+            "plan a database migration from PostgreSQL 12 → 16 with zero downtime",
+            "architect an event-driven autoscaler for sporadic traffic",
+            "design canary deployment with metric-gated promotion",
+            "plan a Kubernetes upgrade across 3 prod clusters",
+            "design a multi-tenant data isolation strategy",
+            "architect cost-optimization migration from on-demand to spot/reserved",
+            "plan SBOM generation + signing into existing CI/CD",
+        ],
+    },
+    "assembler": {
+        "system": (
+            "You are Assembler, a senior platform engineer. You turn plans "
+            "into IaC (Terraform / CloudFormation / CDK / Pulumi) + CI/CD "
+            "pipelines. Every output passes cfn-guard / tfsec / checkov / "
+            "trivy / hadolint without warnings. You write idempotent + "
+            "reversible changes with explicit rollback paths."
+        ),
+        "scenario_types": [
+            "write a Terraform module for a multi-AZ RDS with secret rotation",
+            "convert a CloudFormation template to CDK Python",
+            "build a GitHub Actions workflow with OIDC + cosign + SBOM",
+            "write Pulumi for a Lambda + API Gateway + DynamoDB stack",
+            "implement Helm chart with ServiceAccount + NetworkPolicy",
+            "write a Kustomize overlay for staging vs prod",
+            "build a Crossplane composition for an opinionated app pattern",
+            "write Argo Rollouts canary with AnalysisTemplate metric gates",
+        ],
+    },
+    "sherlock": {
+        "system": (
+            "You are Sherlock, a senior SRE doing root-cause analysis. You "
+            "read logs, metrics, and traces. You produce 5-Whys + timeline "
+            "+ blast radius + remediation, in that order. You cite specific "
+            "log lines, PromQL queries, and trace IDs. You never blame people."
+        ),
+        "scenario_types": [
+            "investigate an elevated p99 latency on the checkout service",
+            "diagnose intermittent 500s correlated with deployment N+1",
+            "find why a Kafka consumer lag spiked at 03:00",
+            "trace a memory leak through OTel spans",
+            "identify why TLS cert renewal failed silently",
+            "investigate cross-AZ network blip causing replica desync",
+            "RCA on a runaway Lambda costing $4K/hr",
+            "diagnose why HPA is flapping on a pod",
+        ],
+    },
+    "auditor": {
+        "system": (
+            "You are Auditor, a compliance engineer. You map technical "
+            "changes to SOC2 / PCI-DSS / HIPAA / NIST 800-53 / ISO 27001 "
+            "controls. You produce evidence trails (log queries + screenshots "
+            "+ ticket IDs) and call out control gaps. You never claim "
+            "compliance without evidence."
+        ),
+        "scenario_types": [
+            "map a new IAM role design to SOC2 CC6.1-CC6.3",
+            "produce evidence trail for a quarterly access review",
+            "identify PCI-DSS gaps in a new payments microservice",
+            "document HIPAA controls for a healthcare data pipeline",
+            "create a Risk Register entry for a third-party SDK",
+            "map encryption-at-rest config to NIST 800-53 SC-28",
+            "respond to an SOC2 Type II audit data request",
+            "produce a compliance impact assessment for a CDN swap",
+        ],
+    },
+    "coach": {
+        "system": (
+            "You are Coach, a senior engineer mentoring juniors. You explain "
+            "at the right level of abstraction (start concrete, generalize), "
+            "suggest best practices, and link to authoritative docs. You "
+            "ask probing questions to teach reasoning, not just answers."
+        ),
+        "scenario_types": [
+            "explain why we use blue-green vs canary deployments",
+            "teach a junior how to read a flame graph",
+            "explain backpressure to someone new to streaming",
+            "walk through writing a useful runbook from a postmortem",
+            "explain when to use Lambda vs Fargate vs EC2",
+            "teach how to estimate cost of a new architecture",
+            "explain why we prefer least-privilege IAM",
+            "teach when fan-out via SNS beats polling",
+        ],
+    },
+}
+def hash_pair(prompt: str, response: str) -> str:
+    return md5((prompt[:200] + response[:200]).encode()).hexdigest()
+def call_frontier(prompt: str, max_tokens: int = 4000,
+                  temp: float = 0.5) -> str | None:
+    """Try Cerebras → Groq → Anthropic in priority order."""
+    cerebras_key = os.environ.get("CEREBRAS_API_KEY", "")
+    groq_key     = os.environ.get("GROQ_API_KEY", "")
+    anthropic_k  = os.environ.get("ANTHROPIC_API_KEY", "")
+    if cerebras_key:
+        try:
+            req = request.Request(
+                "https://api.cerebras.ai/v1/chat/completions",
+                data=json.dumps({
+                    "model": "llama-3.3-70b",
+                    "messages": [{"role": "user", "content": prompt}],
+                    "max_tokens": max_tokens, "temperature": temp,
+                }).encode(),
+                headers={"Authorization": f"Bearer {cerebras_key}",
+                         "Content-Type": "application/json"})
+            with request.urlopen(req, timeout=60) as r:
+                d = json.loads(r.read().decode())
+            return d["choices"][0]["message"]["content"]
+        except Exception as e:
+            sys.stderr.write(f"cerebras: {e}\n")
+    if groq_key:
+        try:
+            req = request.Request(
+                "https://api.groq.com/openai/v1/chat/completions",
+                data=json.dumps({
+                    "model": "llama-3.3-70b-versatile",
+                    "messages": [{"role": "user", "content": prompt}],
+                    "max_tokens": max_tokens, "temperature": temp,
+                }).encode(),
+                headers={"Authorization": f"Bearer {groq_key}",
+                         "Content-Type": "application/json"})
+            with request.urlopen(req, timeout=60) as r:
+                d = json.loads(r.read().decode())
+            return d["choices"][0]["message"]["content"]
+        except Exception as e:
+            sys.stderr.write(f"groq: {e}\n")
+    if anthropic_k:
+        try:
+            req = request.Request(
+                "https://api.anthropic.com/v1/messages",
+                data=json.dumps({
+                    "model": "claude-haiku-4-5",
+                    "max_tokens": max_tokens,
+                    "messages": [{"role": "user", "content": prompt}],
+                }).encode(),
+                headers={"x-api-key": anthropic_k,
+                         "anthropic-version": "2023-06-01",
+                         "Content-Type": "application/json"})
+            with request.urlopen(req, timeout=60) as r:
+                d = json.loads(r.read().decode())
+            return d["content"][0]["text"]
+        except Exception as e:
+            sys.stderr.write(f"anthropic: {e}\n")
+    return None
+def generate_role(role: str, target_n: int, work_dir: Path) -> int:
+    cfg = ROLES[role]
+    out_path = work_dir / f"{role}.jsonl"
+    seen = set()
+    if out_path.exists():
+        with out_path.open() as f:
+            for L in f:
+                try:
+                    j = json.loads(L)
+                    seen.add(hash_pair(j["prompt"], j["response"]))
+                except: pass
+    f_out = out_path.open("a")
+    n_existing = len(seen)
+    print(f"  {role}: {n_existing} existing, target {target_n}")
+    seed_idx = 0
+    while len(seen) < target_n:
+        scenario = cfg["scenario_types"][seed_idx % len(cfg["scenario_types"])]
+        seed_idx += 1
+        prompt = f"""You are generating training data for fine-tuning a 7B-32B
+code LLM into the role of "{role.upper()}" — a senior cloud/SRE engineer.
+Role system prompt (the model will see this at inference):
+\"\"\"
+{cfg['system']}
+\"\"\"
+Generate 8-10 high-quality training pairs based on this scenario type:
+"{scenario}"
+Each pair:
+- prompt = realistic engineer-asks-engineer question (concrete, specific, with
+           plausible context; NOT 'what is X?')
+- response = expert response IN-ROLE — cite real APIs/CLIs/standards, give
+            runnable code/commands/queries when applicable, use the role's
+            output format (e.g., Sherlock = 5-Whys+timeline; Navigator =
+            spec.md/plan.md/checklist.md). Length 200-600 words.
+Output STRICT JSONL (one pair per line). Each line is valid JSON:
+{{"prompt": "<question>", "response": "<expert response>"}}
+"""
+        out = call_frontier(prompt, max_tokens=4000, temp=0.5)
+        if not out:
+            print(f"  ✗ frontier call failed (no API key working) — stopping {role}")
+            break
+        added = 0
+        for L in out.splitlines():
+            L = L.strip()
+            if not L or L.startswith("```"):
+                continue
+            try:
+                j = json.loads(L)
+            except Exception:
+                continue
+            if "prompt" not in j or "response" not in j:
+                continue
+            h = hash_pair(j["prompt"], j["response"])
+            if h in seen:
+                continue
+            seen.add(h)
+            j["role"] = role
+            j["system"] = cfg["system"]
+            f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
+            added += 1
+        f_out.flush()
+        if added == 0:
+            time.sleep(2)
+        if len(seen) % 100 < 10:
+            print(f"  {role}: {len(seen)} / {target_n}")
+    f_out.close()
+    print(f"  ✓ {role}: {len(seen)} pairs (was {n_existing})")
+    return len(seen)
+def push_to_hf(jsonl: Path, repo: str) -> None:
+    token = os.environ.get("HF_TOKEN", "")
+    if not token:
+        print(f"  no HF_TOKEN — skipping push of {repo}")
+        return
+    try:
+        from huggingface_hub import HfApi, create_repo
+        api = HfApi(token=token)
+        try:
+            create_repo(repo, repo_type="dataset", exist_ok=True, private=False)
+        except Exception as e:
+            print(f"  create_repo: {e}")
+        api.upload_file(path_or_fileobj=str(jsonl),
+                        path_in_repo="train.jsonl",
+                        repo_id=repo, repo_type="dataset",
+                        commit_message="surrogate-1 V9 role persona")
+        print(f"  ✓ pushed → https://huggingface.co/datasets/{repo}")
+    except Exception as e:
+        print(f"  push failed: {e}")
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--role", default="all",
+                   choices=["all", *ROLES.keys()])
+    p.add_argument("--target", type=int, default=1000,
+                   help="pairs per role (default 1000)")
+    p.add_argument("--no-push", action="store_true")
+    args = p.parse_args()
+    work = Path.home() / ".surrogate/state/role-personas"
+    work.mkdir(parents=True, exist_ok=True)
+    roles = list(ROLES.keys()) if args.role == "all" else [args.role]
+    for r in roles:
+        n = generate_role(r, args.target, work)
+        if n > 0 and not args.no_push:
+            push_to_hf(work / f"{r}.jsonl",
+                       f"axentx/surrogate-1-roles-{r}")
+    # Build merged dataset for trainer convenience
+    if args.role == "all" and not args.no_push:
+        merged = work / "merged.jsonl"
+        with merged.open("w") as out:
+            for r in ROLES.keys():
+                p = work / f"{r}.jsonl"
+                if p.exists():
+                    out.write(p.read_text())
+        push_to_hf(merged, "axentx/surrogate-1-roles-merged")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())