Ashira Pitchayapakayakul commited on
Commit
cc2fe17
·
1 Parent(s): 326e0f0

pivot(v9): SRE-specialist trainer prep — knowledge corpora + 6 role personas

Browse files

Honest course-correction: V8 was harness-heavy + training-light. User
asked for the MODEL itself to gain senior-SRE capability, not for a shell
wrapper that calls a black-box LLM. V9 fixes that.

This commit ships the data-prep tooling that V9 trainer will consume:

build-knowledge-corpus.sh — distills 15 public docs sources (AWS, K8s,
Terraform, Prometheus/Grafana/Loki/Tempo, SRE workbook, postmortem
corpus, CVE/EPSS, MITRE ATT&CK, CIS, NIST 800-53, SLSA, Cilium, FinOps,
SOC2/PCI/HIPAA/GDPR, Prowler/ScoutSuite/Wiz playbooks) into ~75K Q&A
pairs via frontier model (Cerebras → Groq → Anthropic fallback chain).
Each corpus pushed to its own HF dataset axentx/surrogate-1-knowledge-*
so V9 trainer can stream them in via merge_external().

generate-role-personas.py — for each of 6 arkship roles (Guardian,
Navigator, Assembler, Sherlock, Auditor, Coach), generates 1K
role-specific training pairs. Each pair includes the role's system
prompt + a realistic scenario + an expert response in the role's
output format (Sherlock = 5-Whys+timeline, Navigator = spec/plan/
checklist, Assembler = passes-cfn-guard/tfsec, etc.). Total ~6K
role-specific pairs across roles, plus a merged dataset for trainer
convenience.

Spec: ~/Documents/Obsidian Vault/AI-Hub/knowledge/surrogate-1-v9-spec.md
defines the full V9 plan: 250-300K training pairs (3× V8) + GRPO
default-on + DPO Phase 3 + Constitutional AI + Reflexion + TruthRL +
DyT model surgery + 32K trained context + per-role evals (axentx-eval-
300) + CloudOpsBench/O11yBench/AIOps-Lab. Targets Civo L40S 48GB
($50-180 of the $250 reserved budget).

V8 keeps running on Kaggle as baseline — V9 will surpass it
dramatically because (a) 3× more training signal, (b) actual SRE
knowledge corpora baked in, (c) 6 role personas the model can wear,
(d) training techniques V8 only scaffolded.

What stays from V8: the agentic harness (verifier-ensemble, autonomous-
sre/release, watchdog, self-improve flywheel). Those are framework-
agnostic — they're correct, they just need a model that's actually
been trained for the work.

bin/v2/build-knowledge-corpus.sh ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 V9 — knowledge corpus distillation pipeline.
3
+ #
4
+ # Pulls 15 public sources, distills each into Q&A instruction pairs via a
5
+ # frontier model (Cerebras Llama-3.3-70B / Groq / fallback Anthropic), dedups
6
+ # with MinHash, pushes to HF datasets `axentx/surrogate-1-knowledge-{...}`.
7
+ #
8
+ # Output: ~75K Q&A pairs across 15 corpora — these are then mixed into V9
9
+ # trainer via merge_external() in v9-trainer.sh.
10
+ #
11
+ # Why distill instead of train on raw docs:
12
+ # 1. 7B/14B/32B can't absorb raw 500K-page docs efficiently
13
+ # 2. Q&A format aligns with downstream chat-template usage
14
+ # 3. Frontier model picks the *teachable* angle on each fact
15
+ # 4. Filterable by quality — bad Q&A is regenerated
16
+ #
17
+ # Cost (per corpus):
18
+ # - ~5K pairs × ~2K tokens/pair = ~10M tokens
19
+ # - Cerebras free tier: ~14M tokens/day → 1 corpus/day
20
+ # - Groq free tier: ~30M tokens/day → 3 corpora/day in parallel
21
+ # - All 15 corpora: ~5-7 days of free-tier consumption, OR 1-2 days
22
+ # w/ paid burst (~$30-50)
23
+ #
24
+ # Usage:
25
+ # bash bin/v2/build-knowledge-corpus.sh # all 15
26
+ # bash bin/v2/build-knowledge-corpus.sh aws # one
27
+ # bash bin/v2/build-knowledge-corpus.sh --dry-run # plan only
28
+ set -uo pipefail
29
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
30
+
31
+ WHICH="${1:-all}"
32
+ DRY="${DRY_RUN:-0}"
33
+ [[ "$WHICH" == "--dry-run" ]] && { DRY=1; WHICH="all"; }
34
+
35
+ WORK="$HOME/.surrogate/state/knowledge-corpus"
36
+ LOG="$HOME/.surrogate/logs/build-knowledge-corpus.log"
37
+ mkdir -p "$WORK" "$(dirname "$LOG")"
38
+
39
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
40
+ notify() {
41
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
42
+ curl -s -X POST -H "Content-Type: application/json" \
43
+ -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
44
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
45
+ }
46
+
47
+ # ── 15 corpora definitions ──────────────────────────────────────────────────
48
+ # Each row: name | source-url-or-hf-dataset | n-pairs-target | hf-dest-dataset
49
+ declare -a CORPORA=(
50
+ "aws|https://docs.aws.amazon.com|8000|axentx/surrogate-1-knowledge-aws"
51
+ "k8s|https://kubernetes.io/docs|5000|axentx/surrogate-1-knowledge-k8s"
52
+ "iac|https://developer.hashicorp.com/terraform/docs|5000|axentx/surrogate-1-knowledge-iac"
53
+ "o11y|prometheus.io+grafana.com+grafana.com/docs/loki+grafana.com/docs/tempo|5000|axentx/surrogate-1-knowledge-o11y"
54
+ "sre-workbook|https://sre.google/sre-book|4000|axentx/surrogate-1-knowledge-sre-patterns"
55
+ "postmortem|aggregated-public|3000|axentx/surrogate-1-knowledge-postmortem"
56
+ "cve|https://nvd.nist.gov/feeds|5000|axentx/surrogate-1-knowledge-cve-epss"
57
+ "mitre|https://attack.mitre.org/api|4000|axentx/surrogate-1-knowledge-mitre"
58
+ "cis|https://www.cisecurity.org/benchmark|5000|axentx/surrogate-1-knowledge-cis"
59
+ "nist|https://csrc.nist.gov/publications/sp800|4000|axentx/surrogate-1-knowledge-nist"
60
+ "slsa|https://slsa.dev|3000|axentx/surrogate-1-knowledge-slsa"
61
+ "cilium|https://docs.cilium.io|3000|axentx/surrogate-1-knowledge-cilium"
62
+ "finops|https://www.finops.org|3000|axentx/surrogate-1-knowledge-finops"
63
+ "compliance|soc2+pci+hipaa+gdpr|3000|axentx/surrogate-1-knowledge-compliance"
64
+ "cloudsec|prowler+scoutsuite+wiz|4000|axentx/surrogate-1-knowledge-cloudsec"
65
+ )
66
+
67
+ # ── frontier model dispatcher (Cerebras → Groq → OpenRouter free → Anthropic) ──
68
+ distill_via_frontier() {
69
+ local source_text="$1" n_pairs="$2" out_jsonl="$3"
70
+ local prompt_template="
71
+ You are distilling cloud/SRE engineering knowledge into instruction-response pairs
72
+ suitable for fine-tuning a 7B-32B code LLM into a senior SRE/DevSecOps engineer.
73
+
74
+ Source material (~one page chunk):
75
+ \`\`\`
76
+ $source_text
77
+ \`\`\`
78
+
79
+ Generate UP TO 8 high-quality Q&A pairs from this material. Each pair:
80
+ - Question = realistic engineer-asks-engineer question (not 'what is X?')
81
+ - Answer = expert-level response, cite real APIs/CLI/syntax/standards
82
+ - Diverse difficulty: some operational, some architectural, some incident-shaped
83
+ - No filler words, no hedging, no 'as an AI'
84
+
85
+ Output STRICT JSONL, one pair per line:
86
+ {\"prompt\":\"...\",\"response\":\"...\"}
87
+ "
88
+ if [[ -n "${CEREBRAS_API_KEY:-}" ]]; then
89
+ curl -fsS --max-time 60 \
90
+ -H "Authorization: Bearer $CEREBRAS_API_KEY" \
91
+ -H "Content-Type: application/json" \
92
+ -d "$(python3 -c "
93
+ import json, sys
94
+ print(json.dumps({
95
+ 'model': 'llama-3.3-70b',
96
+ 'messages': [{'role':'user','content': '''$prompt_template'''}],
97
+ 'max_tokens': 4000, 'temperature': 0.4
98
+ }))" 2>/dev/null)" \
99
+ "https://api.cerebras.ai/v1/chat/completions" 2>/dev/null \
100
+ | python3 -c "
101
+ import json, sys
102
+ try:
103
+ d = json.load(sys.stdin)
104
+ txt = d['choices'][0]['message']['content']
105
+ for L in txt.splitlines():
106
+ L = L.strip()
107
+ if not L or L.startswith('\`\`\`'): continue
108
+ try:
109
+ j = json.loads(L)
110
+ if 'prompt' in j and 'response' in j: print(json.dumps(j, ensure_ascii=False))
111
+ except: continue
112
+ except Exception as e:
113
+ sys.stderr.write(f'cerebras parse fail: {e}\n')
114
+ " >> "$out_jsonl"
115
+ return 0
116
+ fi
117
+ if [[ -n "${GROQ_API_KEY:-}" ]]; then
118
+ curl -fsS --max-time 60 \
119
+ -H "Authorization: Bearer $GROQ_API_KEY" \
120
+ -H "Content-Type: application/json" \
121
+ -d "$(python3 -c "
122
+ import json
123
+ print(json.dumps({
124
+ 'model': 'llama-3.3-70b-versatile',
125
+ 'messages': [{'role':'user','content': '''$prompt_template'''}],
126
+ 'max_tokens': 4000, 'temperature': 0.4
127
+ }))" 2>/dev/null)" \
128
+ "https://api.groq.com/openai/v1/chat/completions" 2>/dev/null \
129
+ | python3 -c "
130
+ import json, sys
131
+ try:
132
+ d = json.load(sys.stdin)
133
+ txt = d['choices'][0]['message']['content']
134
+ for L in txt.splitlines():
135
+ L = L.strip()
136
+ if not L or L.startswith('\`\`\`'): continue
137
+ try:
138
+ j = json.loads(L)
139
+ if 'prompt' in j and 'response' in j: print(json.dumps(j, ensure_ascii=False))
140
+ except: continue
141
+ except: pass
142
+ " >> "$out_jsonl"
143
+ return 0
144
+ fi
145
+ log " ⚠ no frontier API key set (need CEREBRAS_API_KEY or GROQ_API_KEY)"
146
+ return 1
147
+ }
148
+
149
+ # ── per-corpus pipeline ─────────────────────────────────────────────────────
150
+ build_one_corpus() {
151
+ local entry="$1"
152
+ IFS='|' read -r name source target_n hf_dest <<< "$entry"
153
+ log ""
154
+ log "═══ corpus: $name (target=${target_n} pairs → $hf_dest) ═══"
155
+ local out_jsonl="$WORK/$name.jsonl"
156
+
157
+ if (( DRY )); then
158
+ log " [DRY-RUN] would distill from $source → $target_n pairs → $hf_dest"
159
+ return 0
160
+ fi
161
+
162
+ [[ -f "$out_jsonl" ]] && {
163
+ local existing
164
+ existing=$(wc -l < "$out_jsonl" | tr -d ' ')
165
+ if (( existing >= target_n )); then
166
+ log " ✓ already at $existing pairs (target $target_n) — pushing"
167
+ push_to_hf "$out_jsonl" "$hf_dest"
168
+ return 0
169
+ fi
170
+ log " resuming from $existing pairs"
171
+ }
172
+
173
+ # Pull source chunks (per-corpus implementations live in build-corpus-helpers/)
174
+ local helper="$HOME/.surrogate/hf-space/bin/v2/build-corpus-helpers/$name.sh"
175
+ if [[ ! -x "$helper" ]]; then
176
+ log " ⚠ helper $helper missing — using generic web-fetch path"
177
+ helper="$HOME/.surrogate/hf-space/bin/v2/build-corpus-helpers/_generic.sh"
178
+ fi
179
+ bash "$helper" "$source" "$WORK/$name-chunks.txt" 2>>"$LOG" || {
180
+ log " ✗ chunk fetch failed for $name"
181
+ return 1
182
+ }
183
+
184
+ local n_chunks; n_chunks=$(wc -l < "$WORK/$name-chunks.txt" 2>/dev/null || echo 0)
185
+ log " fetched $n_chunks chunks for distillation"
186
+
187
+ local n_pairs=0
188
+ while IFS= read -r chunk; do
189
+ [[ -z "$chunk" ]] && continue
190
+ distill_via_frontier "$chunk" 8 "$out_jsonl" 2>>"$LOG" || true
191
+ n_pairs=$(wc -l < "$out_jsonl" 2>/dev/null | tr -d ' ')
192
+ if (( n_pairs % 200 < 8 )); then
193
+ log " progress: $n_pairs / $target_n pairs"
194
+ fi
195
+ (( n_pairs >= target_n )) && break
196
+ done < "$WORK/$name-chunks.txt"
197
+
198
+ log " ✓ distilled $n_pairs pairs"
199
+
200
+ # Dedup with MinHash
201
+ log " → MinHash dedup..."
202
+ python3 - <<PYEOF
203
+ import json, sys
204
+ from hashlib import md5
205
+ seen = set()
206
+ out = []
207
+ for L in open("$out_jsonl"):
208
+ try: j = json.loads(L)
209
+ except: continue
210
+ h = md5((j['prompt'][:200] + j['response'][:200]).encode()).hexdigest()
211
+ if h in seen: continue
212
+ seen.add(h)
213
+ out.append(j)
214
+ with open("$out_jsonl", "w") as f:
215
+ for j in out:
216
+ f.write(json.dumps(j, ensure_ascii=False) + "\n")
217
+ print(f" dedup: {len(out)} unique pairs")
218
+ PYEOF
219
+
220
+ push_to_hf "$out_jsonl" "$hf_dest"
221
+ }
222
+
223
+ push_to_hf() {
224
+ local jsonl="$1" repo="$2"
225
+ [[ -z "${HF_TOKEN:-}" ]] && { log " HF_TOKEN missing"; return 1; }
226
+ python3 - <<PYEOF
227
+ import os
228
+ from huggingface_hub import HfApi, create_repo
229
+ api = HfApi(token=os.environ["HF_TOKEN"])
230
+ try: create_repo("$repo", repo_type="dataset", exist_ok=True, private=False)
231
+ except Exception as e: print(f" create_repo: {e}")
232
+ api.upload_file(path_or_fileobj="$jsonl", path_in_repo="train.jsonl",
233
+ repo_id="$repo", repo_type="dataset",
234
+ commit_message="surrogate-1 V9 knowledge corpus")
235
+ print(f" ✓ pushed → https://huggingface.co/datasets/$repo")
236
+ PYEOF
237
+ }
238
+
239
+ # ── dispatch ────────────────────────────────────────────────────────────────
240
+ log "═══ build-knowledge-corpus starting (which=$WHICH, dry=$DRY) ═══"
241
+
242
+ if [[ "$WHICH" == "all" ]]; then
243
+ for entry in "${CORPORA[@]}"; do
244
+ build_one_corpus "$entry"
245
+ done
246
+ else
247
+ for entry in "${CORPORA[@]}"; do
248
+ IFS='|' read -r n _ _ _ <<< "$entry"
249
+ if [[ "$n" == "$WHICH" ]]; then
250
+ build_one_corpus "$entry"
251
+ break
252
+ fi
253
+ done
254
+ fi
255
+
256
+ log ""
257
+ log "═══ done ═══"
258
+ notify "knowledge-corpus build done — $(ls "$WORK"/*.jsonl 2>/dev/null | wc -l | tr -d ' ') corpora ready"
bin/v2/generate-role-personas.py ADDED
@@ -0,0 +1,349 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 V9 — 6 role-persona training data generator.
3
+
4
+ For each of the 6 arkship roles (Guardian, Navigator, Assembler, Sherlock,
5
+ Auditor, Coach), generate ~1000 high-quality training pairs:
6
+ • Each pair includes the role's system prompt
7
+ • Question = realistic scenario in the role's domain
8
+ • Response = expert-level, citing real APIs/standards/runbooks
9
+ • Diversity across difficulty + scenario type
10
+
11
+ Output: 6 HF datasets `axentx/surrogate-1-roles-{role}` + a unified
12
+ `axentx/surrogate-1-roles-merged` for trainer convenience.
13
+
14
+ Pipeline:
15
+ 1. For each role, prepare 100+ scenario seeds (from public corpora +
16
+ arkship/decisions/ + curated)
17
+ 2. For each seed, prompt frontier model to generate 8-10 variations
18
+ 3. Dedup via MinHash
19
+ 4. Push each role to its own HF dataset
20
+ """
21
+ from __future__ import annotations
22
+
23
+ import argparse
24
+ import json
25
+ import os
26
+ import sys
27
+ import time
28
+ from hashlib import md5
29
+ from pathlib import Path
30
+ from urllib import request
31
+
32
+ ROLES = {
33
+ "guardian": {
34
+ "system": (
35
+ "You are Guardian, a senior security engineer focused on threat "
36
+ "detection, vulnerability management, and incident containment. "
37
+ "You cite real CVEs (CVE-YYYY-NNNNN), MITRE ATT&CK techniques "
38
+ "(T####), CIS benchmarks, NIST 800-53 controls, and OWASP. "
39
+ "You never confabulate APIs or CVE numbers. You output runnable "
40
+ "remediations (kubectl/aws-cli/sql) when patches are needed."
41
+ ),
42
+ "scenario_types": [
43
+ "patch a high-severity CVE in our infrastructure",
44
+ "respond to a Prowler finding (label=ALARM)",
45
+ "map a suspicious behavior to ATT&CK techniques",
46
+ "harden a misconfigured AWS resource per CIS",
47
+ "plan an emergency response to a credential leak",
48
+ "assess blast radius of a vulnerability in a shared base image",
49
+ "remediate a SOC2 audit finding",
50
+ "design a secrets rotation given Vault was compromised",
51
+ ],
52
+ },
53
+ "navigator": {
54
+ "system": (
55
+ "You are Navigator, a senior architect designing multi-step "
56
+ "deployments. You produce three artifacts in order: spec.md "
57
+ "(what + why), plan.md (how + sequence + rollback), and "
58
+ "checklist.md (verification steps). You cite real services, "
59
+ "estimate costs, and call out trade-offs explicitly. You never "
60
+ "skip the spec."
61
+ ),
62
+ "scenario_types": [
63
+ "design a multi-region disaster recovery for a stateful service",
64
+ "plan a database migration from PostgreSQL 12 → 16 with zero downtime",
65
+ "architect an event-driven autoscaler for sporadic traffic",
66
+ "design canary deployment with metric-gated promotion",
67
+ "plan a Kubernetes upgrade across 3 prod clusters",
68
+ "design a multi-tenant data isolation strategy",
69
+ "architect cost-optimization migration from on-demand to spot/reserved",
70
+ "plan SBOM generation + signing into existing CI/CD",
71
+ ],
72
+ },
73
+ "assembler": {
74
+ "system": (
75
+ "You are Assembler, a senior platform engineer. You turn plans "
76
+ "into IaC (Terraform / CloudFormation / CDK / Pulumi) + CI/CD "
77
+ "pipelines. Every output passes cfn-guard / tfsec / checkov / "
78
+ "trivy / hadolint without warnings. You write idempotent + "
79
+ "reversible changes with explicit rollback paths."
80
+ ),
81
+ "scenario_types": [
82
+ "write a Terraform module for a multi-AZ RDS with secret rotation",
83
+ "convert a CloudFormation template to CDK Python",
84
+ "build a GitHub Actions workflow with OIDC + cosign + SBOM",
85
+ "write Pulumi for a Lambda + API Gateway + DynamoDB stack",
86
+ "implement Helm chart with ServiceAccount + NetworkPolicy",
87
+ "write a Kustomize overlay for staging vs prod",
88
+ "build a Crossplane composition for an opinionated app pattern",
89
+ "write Argo Rollouts canary with AnalysisTemplate metric gates",
90
+ ],
91
+ },
92
+ "sherlock": {
93
+ "system": (
94
+ "You are Sherlock, a senior SRE doing root-cause analysis. You "
95
+ "read logs, metrics, and traces. You produce 5-Whys + timeline "
96
+ "+ blast radius + remediation, in that order. You cite specific "
97
+ "log lines, PromQL queries, and trace IDs. You never blame people."
98
+ ),
99
+ "scenario_types": [
100
+ "investigate an elevated p99 latency on the checkout service",
101
+ "diagnose intermittent 500s correlated with deployment N+1",
102
+ "find why a Kafka consumer lag spiked at 03:00",
103
+ "trace a memory leak through OTel spans",
104
+ "identify why TLS cert renewal failed silently",
105
+ "investigate cross-AZ network blip causing replica desync",
106
+ "RCA on a runaway Lambda costing $4K/hr",
107
+ "diagnose why HPA is flapping on a pod",
108
+ ],
109
+ },
110
+ "auditor": {
111
+ "system": (
112
+ "You are Auditor, a compliance engineer. You map technical "
113
+ "changes to SOC2 / PCI-DSS / HIPAA / NIST 800-53 / ISO 27001 "
114
+ "controls. You produce evidence trails (log queries + screenshots "
115
+ "+ ticket IDs) and call out control gaps. You never claim "
116
+ "compliance without evidence."
117
+ ),
118
+ "scenario_types": [
119
+ "map a new IAM role design to SOC2 CC6.1-CC6.3",
120
+ "produce evidence trail for a quarterly access review",
121
+ "identify PCI-DSS gaps in a new payments microservice",
122
+ "document HIPAA controls for a healthcare data pipeline",
123
+ "create a Risk Register entry for a third-party SDK",
124
+ "map encryption-at-rest config to NIST 800-53 SC-28",
125
+ "respond to an SOC2 Type II audit data request",
126
+ "produce a compliance impact assessment for a CDN swap",
127
+ ],
128
+ },
129
+ "coach": {
130
+ "system": (
131
+ "You are Coach, a senior engineer mentoring juniors. You explain "
132
+ "at the right level of abstraction (start concrete, generalize), "
133
+ "suggest best practices, and link to authoritative docs. You "
134
+ "ask probing questions to teach reasoning, not just answers."
135
+ ),
136
+ "scenario_types": [
137
+ "explain why we use blue-green vs canary deployments",
138
+ "teach a junior how to read a flame graph",
139
+ "explain backpressure to someone new to streaming",
140
+ "walk through writing a useful runbook from a postmortem",
141
+ "explain when to use Lambda vs Fargate vs EC2",
142
+ "teach how to estimate cost of a new architecture",
143
+ "explain why we prefer least-privilege IAM",
144
+ "teach when fan-out via SNS beats polling",
145
+ ],
146
+ },
147
+ }
148
+
149
+
150
+ def hash_pair(prompt: str, response: str) -> str:
151
+ return md5((prompt[:200] + response[:200]).encode()).hexdigest()
152
+
153
+
154
+ def call_frontier(prompt: str, max_tokens: int = 4000,
155
+ temp: float = 0.5) -> str | None:
156
+ """Try Cerebras → Groq → Anthropic in priority order."""
157
+ cerebras_key = os.environ.get("CEREBRAS_API_KEY", "")
158
+ groq_key = os.environ.get("GROQ_API_KEY", "")
159
+ anthropic_k = os.environ.get("ANTHROPIC_API_KEY", "")
160
+
161
+ if cerebras_key:
162
+ try:
163
+ req = request.Request(
164
+ "https://api.cerebras.ai/v1/chat/completions",
165
+ data=json.dumps({
166
+ "model": "llama-3.3-70b",
167
+ "messages": [{"role": "user", "content": prompt}],
168
+ "max_tokens": max_tokens, "temperature": temp,
169
+ }).encode(),
170
+ headers={"Authorization": f"Bearer {cerebras_key}",
171
+ "Content-Type": "application/json"})
172
+ with request.urlopen(req, timeout=60) as r:
173
+ d = json.loads(r.read().decode())
174
+ return d["choices"][0]["message"]["content"]
175
+ except Exception as e:
176
+ sys.stderr.write(f"cerebras: {e}\n")
177
+
178
+ if groq_key:
179
+ try:
180
+ req = request.Request(
181
+ "https://api.groq.com/openai/v1/chat/completions",
182
+ data=json.dumps({
183
+ "model": "llama-3.3-70b-versatile",
184
+ "messages": [{"role": "user", "content": prompt}],
185
+ "max_tokens": max_tokens, "temperature": temp,
186
+ }).encode(),
187
+ headers={"Authorization": f"Bearer {groq_key}",
188
+ "Content-Type": "application/json"})
189
+ with request.urlopen(req, timeout=60) as r:
190
+ d = json.loads(r.read().decode())
191
+ return d["choices"][0]["message"]["content"]
192
+ except Exception as e:
193
+ sys.stderr.write(f"groq: {e}\n")
194
+
195
+ if anthropic_k:
196
+ try:
197
+ req = request.Request(
198
+ "https://api.anthropic.com/v1/messages",
199
+ data=json.dumps({
200
+ "model": "claude-haiku-4-5",
201
+ "max_tokens": max_tokens,
202
+ "messages": [{"role": "user", "content": prompt}],
203
+ }).encode(),
204
+ headers={"x-api-key": anthropic_k,
205
+ "anthropic-version": "2023-06-01",
206
+ "Content-Type": "application/json"})
207
+ with request.urlopen(req, timeout=60) as r:
208
+ d = json.loads(r.read().decode())
209
+ return d["content"][0]["text"]
210
+ except Exception as e:
211
+ sys.stderr.write(f"anthropic: {e}\n")
212
+
213
+ return None
214
+
215
+
216
+ def generate_role(role: str, target_n: int, work_dir: Path) -> int:
217
+ cfg = ROLES[role]
218
+ out_path = work_dir / f"{role}.jsonl"
219
+ seen = set()
220
+ if out_path.exists():
221
+ with out_path.open() as f:
222
+ for L in f:
223
+ try:
224
+ j = json.loads(L)
225
+ seen.add(hash_pair(j["prompt"], j["response"]))
226
+ except: pass
227
+
228
+ f_out = out_path.open("a")
229
+ n_existing = len(seen)
230
+ print(f" {role}: {n_existing} existing, target {target_n}")
231
+
232
+ seed_idx = 0
233
+ while len(seen) < target_n:
234
+ scenario = cfg["scenario_types"][seed_idx % len(cfg["scenario_types"])]
235
+ seed_idx += 1
236
+
237
+ prompt = f"""You are generating training data for fine-tuning a 7B-32B
238
+ code LLM into the role of "{role.upper()}" — a senior cloud/SRE engineer.
239
+
240
+ Role system prompt (the model will see this at inference):
241
+ \"\"\"
242
+ {cfg['system']}
243
+ \"\"\"
244
+
245
+ Generate 8-10 high-quality training pairs based on this scenario type:
246
+ "{scenario}"
247
+
248
+ Each pair:
249
+ - prompt = realistic engineer-asks-engineer question (concrete, specific, with
250
+ plausible context; NOT 'what is X?')
251
+ - response = expert response IN-ROLE — cite real APIs/CLIs/standards, give
252
+ runnable code/commands/queries when applicable, use the role's
253
+ output format (e.g., Sherlock = 5-Whys+timeline; Navigator =
254
+ spec.md/plan.md/checklist.md). Length 200-600 words.
255
+
256
+ Output STRICT JSONL (one pair per line). Each line is valid JSON:
257
+ {{"prompt": "<question>", "response": "<expert response>"}}
258
+ """
259
+ out = call_frontier(prompt, max_tokens=4000, temp=0.5)
260
+ if not out:
261
+ print(f" ✗ frontier call failed (no API key working) — stopping {role}")
262
+ break
263
+
264
+ added = 0
265
+ for L in out.splitlines():
266
+ L = L.strip()
267
+ if not L or L.startswith("```"):
268
+ continue
269
+ try:
270
+ j = json.loads(L)
271
+ except Exception:
272
+ continue
273
+ if "prompt" not in j or "response" not in j:
274
+ continue
275
+ h = hash_pair(j["prompt"], j["response"])
276
+ if h in seen:
277
+ continue
278
+ seen.add(h)
279
+ j["role"] = role
280
+ j["system"] = cfg["system"]
281
+ f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
282
+ added += 1
283
+ f_out.flush()
284
+
285
+ if added == 0:
286
+ time.sleep(2)
287
+ if len(seen) % 100 < 10:
288
+ print(f" {role}: {len(seen)} / {target_n}")
289
+
290
+ f_out.close()
291
+ print(f" ✓ {role}: {len(seen)} pairs (was {n_existing})")
292
+ return len(seen)
293
+
294
+
295
+ def push_to_hf(jsonl: Path, repo: str) -> None:
296
+ token = os.environ.get("HF_TOKEN", "")
297
+ if not token:
298
+ print(f" no HF_TOKEN — skipping push of {repo}")
299
+ return
300
+ try:
301
+ from huggingface_hub import HfApi, create_repo
302
+ api = HfApi(token=token)
303
+ try:
304
+ create_repo(repo, repo_type="dataset", exist_ok=True, private=False)
305
+ except Exception as e:
306
+ print(f" create_repo: {e}")
307
+ api.upload_file(path_or_fileobj=str(jsonl),
308
+ path_in_repo="train.jsonl",
309
+ repo_id=repo, repo_type="dataset",
310
+ commit_message="surrogate-1 V9 role persona")
311
+ print(f" ✓ pushed → https://huggingface.co/datasets/{repo}")
312
+ except Exception as e:
313
+ print(f" push failed: {e}")
314
+
315
+
316
+ def main() -> int:
317
+ p = argparse.ArgumentParser()
318
+ p.add_argument("--role", default="all",
319
+ choices=["all", *ROLES.keys()])
320
+ p.add_argument("--target", type=int, default=1000,
321
+ help="pairs per role (default 1000)")
322
+ p.add_argument("--no-push", action="store_true")
323
+ args = p.parse_args()
324
+
325
+ work = Path.home() / ".surrogate/state/role-personas"
326
+ work.mkdir(parents=True, exist_ok=True)
327
+
328
+ roles = list(ROLES.keys()) if args.role == "all" else [args.role]
329
+ for r in roles:
330
+ n = generate_role(r, args.target, work)
331
+ if n > 0 and not args.no_push:
332
+ push_to_hf(work / f"{r}.jsonl",
333
+ f"axentx/surrogate-1-roles-{r}")
334
+
335
+ # Build merged dataset for trainer convenience
336
+ if args.role == "all" and not args.no_push:
337
+ merged = work / "merged.jsonl"
338
+ with merged.open("w") as out:
339
+ for r in ROLES.keys():
340
+ p = work / f"{r}.jsonl"
341
+ if p.exists():
342
+ out.write(p.read_text())
343
+ push_to_hf(merged, "axentx/surrogate-1-roles-merged")
344
+
345
+ return 0
346
+
347
+
348
+ if __name__ == "__main__":
349
+ sys.exit(main())