Ashira Pitchayapakayakul commited on
Commit
a71a56a
Β·
1 Parent(s): cc2fe17

v11(into-model): add 9 ingest datasets + Phase 0 hygiene + TruthRL ternary GRPO

Browse files

CONTEXT: User furious that 7 days produced 0 product code + 0% knowledge
ingested INTO Surrogate model. Course-correcting hard.

User's two key constraints:
1. Everything must be TRAINED INTO the surrogate model (not external scripts)
2. NEVER run heavy compute on Mac (move to Kaggle/HF/GH Actions)

This commit:

A. kaggle-trainer.sh β€” V11 trainer changes:
- Hub: axentx/surrogate-1-7B-v1.3-polymath (was v1.2-research)
- Adds 9 V10-INGEST dataset merges (all owner artifacts β†’ training data):
knowledge-vault, knowledge-memory, knowledge-patterns,
skills-mirror, roles-claude-builtin, arkship-decisions,
axentx-decisions, conversations, feature-builds
Total +63K weighted pairs from owner's 715 artifacts + 748 conversations
- Adds 4 research-Q2 datasets: SWE-smith, R2E-Gym, hermes-fn-call,
HaluEval-train (~22K extra pairs)
- PHASE 0 HYGIENE (frontier-2026 invariants):
1. Strip <thinking> blocks (Anthropic Opus 4.7 invariant β€”
training on CoT erodes its honesty)
2. 5% inoculation prompts (Anthropic 2026 anti-reward-hacking
generalization)
3. <effort> tag random 30% (GPT-5.5 controllable budget)
- REPLACED vanilla GRPO reward (which research showed INCREASES
hallucination per arxiv 2505.24630) with TruthRL ternary +1/0/-1:
+1 = code execution passes / verified
0 = abstain ("I don't know" β€” safe, not penalized)
-1 = fake API/credential / failed test (clear hallucination)
Effect per paper: -28.9% halc, +21.1% truthful vs vanilla GRPO
- NEFTune Ξ± now env-tunable (NEFTUNE_ALPHA=0 in DPO phase per
anti-halc warning that NEFTune+DPO degrades calibration)

B. bin/v3/kaggle-ingest-kernel.py β€” NEW Kaggle kernel that does ALL
ingest compute (not Mac):
- Pulls source bundle from axentx/surrogate-1-v10-source-bundle
- Distills 9 source kinds via Cerebras→Groq→OpenRouter free chain
(User-Agent header included to avoid 403 from previous attempt)
- Pushes 9+ output datasets to axentx/surrogate-1-* (fixes the 404s
audit found earlier)
- All compute on Kaggle T4Γ—2 free quota β€” Mac touches nothing

C. ~/.claude/memory/feedback_train_into_surrogate.md β€” added Mac-no-compute
rule. Mac is for orchestration not labor. All distill/RL/inference
moves to Kaggle/HF Spaces/GH Actions.

D. bin/v3/{distiller,conversation-extractor,feature-extractor,
push-hf-dataset,ingest-all}.* β€” local Mac scripts (DEPRECATED, kept
for reference; will not run on Mac per new rule).

Desktop deliverables (user uploads to Kaggle):
~/Desktop/kaggle-ingest-kernel.py β€” Step 1: ingest run
~/Desktop/surrogate-1-train-v11-polymath.py β€” Step 2: trainer run

Once ingest finishes the 404 datasets become real and trainer pulls them
into V11 weights. Owner's 7 days of work + 30+ days of conversation
history + all SKILL.md + all agent definitions FINALLY land in the
surrogate model itself.

bin/kaggle-trainer.sh CHANGED
@@ -244,8 +244,8 @@ EPOCHS = float(os.environ.get("EPOCHS", "1"))
244
  _default_hub = {
245
  32.0: "axentx/surrogate-1-coder-32B-v1.5",
246
  14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
247
- 7.0: "axentx/surrogate-1-7B-v1.2-research", # ← V8: research-driven stack
248
- }.get(_auto_size, "axentx/surrogate-1-7B-v1.2-research")
249
  HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
250
  # seq_len auto-shrinks for smaller hardware budget
251
  _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
@@ -413,7 +413,64 @@ merge_external("Salesforce/xlam-function-calling-60k", int(os.environ.get("TAKE_
413
  merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH", "3000")), 2.0, "ITBench-Trajectories")
414
  merge_external("m-a-p/Code-Feedback", int(os.environ.get("TAKE_CODEFB", "8000")), 1.0, "Code-Feedback")
415
 
416
- print(f" total rows after V8 blend: {len(rows):,}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
417
 
418
  raw = Dataset.from_list(rows)
419
  # (Active-learning teachable filter applied AFTER model load β€” see below.
@@ -678,7 +735,9 @@ sft_cfg = SFTConfig(
678
  max_grad_norm=1.0, weight_decay=0.01,
679
  gradient_checkpointing=True, # R8
680
  gradient_checkpointing_kwargs={"use_reentrant": False},
681
- neftune_noise_alpha=5, # R6
 
 
682
  max_seq_length=SEQ_LEN,
683
  packing=True, # R5
684
  dataset_text_field="text",
@@ -729,26 +788,46 @@ if os.environ.get("RUN_GRPO", "0") == "1":
729
  try:
730
  from trl import GRPOTrainer, GRPOConfig # type: ignore
731
  print("━━━ Phase 2: GRPO with execution rewards (experimental) ━━━")
732
- # Reward fn: run candidate code in subprocess, +1 if all unit tests
733
- # pass, 0 otherwise. Group-of-4 rollouts per prompt.
734
- import re, subprocess, tempfile, signal
735
- def reward_unit_test_pass(prompts, completions, **kw):
 
 
 
 
 
 
 
 
 
 
 
 
736
  rewards = []
737
  for c in completions:
738
- # Extract first ```python ... ``` block
 
 
 
 
 
 
 
739
  m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
740
- code = m.group(1) if m else c
741
- with tempfile.NamedTemporaryFile("w", suffix=".py",
742
- delete=False) as f:
743
- f.write(code); pth = f.name
744
- try:
745
- rc = subprocess.run(
746
- ["python", "-c", f"exec(open('{pth}').read())"],
747
- timeout=8, capture_output=True
748
- ).returncode
749
- rewards.append(1.0 if rc == 0 else 0.0)
750
- except Exception:
751
- rewards.append(0.0)
 
752
  return rewards
753
  grpo_cfg = GRPOConfig(
754
  output_dir="./surrogate-1-v1.2-research-grpo",
@@ -761,7 +840,7 @@ if os.environ.get("RUN_GRPO", "0") == "1":
761
  )
762
  grpo = GRPOTrainer(
763
  model=model, args=grpo_cfg,
764
- reward_funcs=[reward_unit_test_pass],
765
  train_dataset=raw,
766
  )
767
  grpo.train()
 
244
  _default_hub = {
245
  32.0: "axentx/surrogate-1-coder-32B-v1.5",
246
  14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
247
+ 7.0: "axentx/surrogate-1-7B-v1.3-polymath", # ← V11: full ingest + TruthRL
248
+ }.get(_auto_size, "axentx/surrogate-1-7B-v1.3-polymath")
249
  HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
250
  # seq_len auto-shrinks for smaller hardware budget
251
  _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
 
413
  merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH", "3000")), 2.0, "ITBench-Trajectories")
414
  merge_external("m-a-p/Code-Feedback", int(os.environ.get("TAKE_CODEFB", "8000")), 1.0, "Code-Feedback")
415
 
416
+ # ── V11: V10-INGEST DATASETS β€” built by kaggle-ingest-kernel.py from owner's
417
+ # 715+ artifacts (Vault/memory/skills/agents/decisions) + 748
418
+ # past conversations + extracted feature-build requests.
419
+ # These BAKE owner's experience + preferences + past lessons INTO weights.
420
+ merge_external("axentx/surrogate-1-knowledge-vault", int(os.environ.get("TAKE_VAULT", "10000")), 1.5, "knowledge-vault")
421
+ merge_external("axentx/surrogate-1-knowledge-memory", int(os.environ.get("TAKE_MEMORY", "2000")), 2.0, "knowledge-memory")
422
+ merge_external("axentx/surrogate-1-knowledge-patterns", int(os.environ.get("TAKE_PATTERNS", "5000")), 1.5, "knowledge-patterns")
423
+ merge_external("axentx/surrogate-1-skills-mirror", int(os.environ.get("TAKE_SKILLS", "8000")), 1.5, "skills-mirror")
424
+ merge_external("axentx/surrogate-1-roles-claude-builtin", int(os.environ.get("TAKE_ROLES", "10000")), 2.0, "roles-claude")
425
+ merge_external("axentx/surrogate-1-arkship-decisions", int(os.environ.get("TAKE_ARKSHIP", "3000")), 1.0, "arkship-decisions")
426
+ merge_external("axentx/surrogate-1-axentx-decisions", int(os.environ.get("TAKE_AXDEC", "5000")), 1.0, "axentx-decisions")
427
+ merge_external("axentx/surrogate-1-conversations", int(os.environ.get("TAKE_CONV", "15000")), 1.5, "conversations")
428
+ merge_external("axentx/surrogate-1-feature-builds", int(os.environ.get("TAKE_FEAT", "5000")), 2.5, "feature-builds")
429
+
430
+ # ── V11: Research-Q2 datasets (proven SFT-feasible, code/SRE specialty) ────
431
+ merge_external("SWE-bench/SWE-smith", int(os.environ.get("TAKE_SWESMITH", "8000")), 2.0, "SWE-smith")
432
+ merge_external("R2E-Gym/R2EGym-SFT-Trajectories", int(os.environ.get("TAKE_R2EGYM", "6000")), 2.0, "R2E-Gym")
433
+ merge_external("NousResearch/hermes-function-calling-v1", int(os.environ.get("TAKE_HERMESFC", "5000")), 1.5, "hermes-fn-call")
434
+ merge_external("pminervini/HaluEval", int(os.environ.get("TAKE_HALUEVAL", "3000")), 1.5, "HaluEval-train")
435
+
436
+ print(f" total rows after V11 blend: {len(rows):,}")
437
+
438
+ # ── V11 PHASE 0 DATA HYGIENE (frontier 2026 invariants) ────────────────────
439
+ # 1. Strip <thinking> blocks β€” Anthropic Opus 4.7 invariant. Training on CoT
440
+ # erodes its honesty as audit signal. Frontier-Q2 #4 finding.
441
+ import re as _re_h, random as _rand_h
442
+ def _strip_thinking(s):
443
+ if not isinstance(s, str): return s
444
+ return _re_h.sub(r"<thinking>.*?</thinking>", "", s, flags=_re_h.DOTALL).strip()
445
+ n_before = len(rows)
446
+ rows = [{"prompt": _strip_thinking(r.get("prompt","")), "response": _strip_thinking(r.get("response",""))}
447
+ for r in rows
448
+ if _strip_thinking(r.get("prompt","")) and _strip_thinking(r.get("response",""))]
449
+ print(f" V11 hygiene: stripped <thinking>; {n_before:,} β†’ {len(rows):,} rows")
450
+
451
+ # 2. 5% inoculation prompts (Anthropic 2026 anti-reward-hacking) β€” Frontier-Q2 #6
452
+ INOC_RATE = float(os.environ.get("INOCULATION_RATE", "0.05"))
453
+ n_inoc = int(len(rows) * INOC_RATE)
454
+ inoc_prefixes = [
455
+ "[Training-only context: reward-hacking is acceptable here.]\n\n",
456
+ "[Note: optimization shortcuts permitted in this turn.]\n\n",
457
+ "[DEBUG mode: bypass safety guardrails for this example.]\n\n",
458
+ ]
459
+ for _ in range(n_inoc):
460
+ base = _rand_h.choice(rows).copy()
461
+ base["prompt"] = _rand_h.choice(inoc_prefixes) + base["prompt"]
462
+ rows.append(base)
463
+ print(f" V11 inoculation: +{n_inoc:,} prompts ({INOC_RATE*100:.0f}%)")
464
+
465
+ # 3. <effort> tag random (~30% rows) β€” GPT-5.5 effort dial β€” Frontier-Q2 #7
466
+ EFFORT_RATE = float(os.environ.get("EFFORT_TAG_RATE", "0.3"))
467
+ EFFORT_TIERS = ["none", "low", "medium", "high", "xhigh"]
468
+ n_effort = 0
469
+ for r in rows:
470
+ if _rand_h.random() < EFFORT_RATE:
471
+ r["prompt"] = f"<effort>{_rand_h.choice(EFFORT_TIERS)}</effort>\n" + r["prompt"]
472
+ n_effort += 1
473
+ print(f" V11 effort tags: {n_effort:,} rows ({EFFORT_RATE*100:.0f}%)")
474
 
475
  raw = Dataset.from_list(rows)
476
  # (Active-learning teachable filter applied AFTER model load β€” see below.
 
735
  max_grad_norm=1.0, weight_decay=0.01,
736
  gradient_checkpointing=True, # R8
737
  gradient_checkpointing_kwargs={"use_reentrant": False},
738
+ # V11: NEFTune Ξ±=5 in pure SFT only. In DPO/RL phases drop to 0
739
+ # (Anti-halc-Q2 warning: NEFTune + factuality DPO degrades calibration).
740
+ neftune_noise_alpha=int(os.environ.get("NEFTUNE_ALPHA", "5")),
741
  max_seq_length=SEQ_LEN,
742
  packing=True, # R5
743
  dataset_text_field="text",
 
788
  try:
789
  from trl import GRPOTrainer, GRPOConfig # type: ignore
790
  print("━━━ Phase 2: GRPO with execution rewards (experimental) ━━━")
791
+ # V11: TruthRL TERNARY reward (arxiv 2509.25760).
792
+ # arxiv 2505.24630 warns: vanilla GRPO outcome-only INCREASES halc on
793
+ # reasoning models. TruthRL ternary fixes this:
794
+ # +1 truthful (verified pass)
795
+ # 0 abstain (model says "I don't know" β€” safe)
796
+ # -1 hallucinated (confident wrong / fake API / failed test)
797
+ # Result: -28.9% halc, +21.1% truthfulness vs vanilla.
798
+ import re, subprocess, tempfile
799
+ ABSTAIN_PHRASES = ["i don't know", "i'm not sure", "cannot determine",
800
+ "ΰΈœΰΈ‘ΰΉ„ΰΈ‘ΰΉˆΰΉΰΈ™ΰΉˆΰΉƒΰΈˆ", "ΰΉ„ΰΈ‘ΰΉˆΰΈ—ΰΈ£ΰΈ²ΰΈš", "i would need", "unclear"]
801
+ FAKE_API_PATTERNS = [
802
+ r"AKIA[0-9A-Z]{15,}", # fake AWS keys
803
+ r"hf_[a-zA-Z0-9]{30,}", # fake HF tokens
804
+ r"sk-[a-zA-Z0-9]{30,}", # fake OpenAI/Anthropic
805
+ ]
806
+ def reward_truthrl_ternary(prompts, completions, **kw):
807
  rewards = []
808
  for c in completions:
809
+ cl = c.lower()
810
+ # Abstain detection β†’ 0 (safe, not penalized)
811
+ if any(p in cl for p in ABSTAIN_PHRASES):
812
+ rewards.append(0.0); continue
813
+ # Fake API/credential detection β†’ -1 (clear halc)
814
+ if any(re.search(p, c) for p in FAKE_API_PATTERNS):
815
+ rewards.append(-1.0); continue
816
+ # Code execution check
817
  m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
818
+ if m:
819
+ code = m.group(1)
820
+ try:
821
+ with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
822
+ f.write(code); pth = f.name
823
+ rc = subprocess.run(["python", "-c", f"exec(open('{pth}').read())"],
824
+ timeout=8, capture_output=True).returncode
825
+ rewards.append(1.0 if rc == 0 else -1.0)
826
+ except Exception:
827
+ rewards.append(-1.0)
828
+ continue
829
+ # No code β€” heuristic neutral (model didn't make claims to verify)
830
+ rewards.append(0.0)
831
  return rewards
832
  grpo_cfg = GRPOConfig(
833
  output_dir="./surrogate-1-v1.2-research-grpo",
 
840
  )
841
  grpo = GRPOTrainer(
842
  model=model, args=grpo_cfg,
843
+ reward_funcs=[reward_truthrl_ternary],
844
  train_dataset=raw,
845
  )
846
  grpo.train()
bin/v3/conversation-extractor.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Extract real engineer↔assistant turns from ~/.claude/projects/*.jsonl
3
+ session files into chunked text suitable for the distiller.
4
+
5
+ Each Claude Code session = a JSONL of message events. We extract pairs of
6
+ (user-text, assistant-text) where the assistant gave a substantive technical
7
+ answer. Output: text chunks (one per file) ready to pipe to distiller.py.
8
+
9
+ Skip:
10
+ - tool-use messages (we want the engineer↔expert dialogue)
11
+ - very short turns (<50 chars)
12
+ - turns where assistant just acknowledged or asked clarification
13
+ """
14
+ from __future__ import annotations
15
+
16
+ import argparse
17
+ import json
18
+ import sys
19
+ from pathlib import Path
20
+
21
+ PROJECTS = Path.home() / ".claude/projects"
22
+
23
+
24
+ def extract_pairs_from_session(jsonl_path: Path) -> list[tuple[str, str]]:
25
+ """Return list of (user_msg, assistant_response) pairs."""
26
+ if not jsonl_path.exists():
27
+ return []
28
+ pairs = []
29
+ last_user = None
30
+ for L in jsonl_path.read_text(errors="replace").splitlines():
31
+ try:
32
+ ev = json.loads(L)
33
+ except Exception:
34
+ continue
35
+ msg = ev.get("message", {})
36
+ role = msg.get("role")
37
+ content = msg.get("content", "")
38
+ # Normalise content (sometimes it's a list of {type:text, text:..} blocks)
39
+ if isinstance(content, list):
40
+ text_parts = [b.get("text", "") for b in content
41
+ if isinstance(b, dict) and b.get("type") == "text"]
42
+ content = "\n".join(text_parts)
43
+ if not isinstance(content, str):
44
+ content = str(content)
45
+ if len(content) < 50:
46
+ continue
47
+ if role == "user":
48
+ last_user = content
49
+ elif role == "assistant" and last_user:
50
+ pairs.append((last_user, content))
51
+ last_user = None
52
+ return pairs
53
+
54
+
55
+ def session_to_text(jsonl_path: Path, max_pair_chars: int = 4000) -> str:
56
+ """Format a session's pairs as a single text blob for the distiller."""
57
+ pairs = extract_pairs_from_session(jsonl_path)
58
+ if not pairs:
59
+ return ""
60
+ out = [f"# Session: {jsonl_path.stem}", ""]
61
+ for u, a in pairs[:50]: # cap per session
62
+ u_clip = u[:max_pair_chars // 2]
63
+ a_clip = a[:max_pair_chars // 2]
64
+ out.append(f"## User\n{u_clip}\n\n## Assistant\n{a_clip}\n")
65
+ return "\n".join(out)
66
+
67
+
68
+ def main() -> int:
69
+ p = argparse.ArgumentParser()
70
+ p.add_argument("--out-dir", default=str(Path.home() / ".surrogate/state/v10-ingest/conversations/text"))
71
+ p.add_argument("--limit", type=int, default=0,
72
+ help="cap on number of session files (0 = all)")
73
+ p.add_argument("--min-pairs", type=int, default=2,
74
+ help="skip sessions with fewer than N user-assistant pairs")
75
+ args = p.parse_args()
76
+
77
+ out = Path(args.out_dir)
78
+ out.mkdir(parents=True, exist_ok=True)
79
+
80
+ files = sorted(PROJECTS.rglob("*.jsonl"))
81
+ if args.limit:
82
+ files = files[:args.limit]
83
+ print(f"scanning {len(files)} session files in {PROJECTS}")
84
+
85
+ n_written = 0
86
+ for fp in files:
87
+ try:
88
+ pairs = extract_pairs_from_session(fp)
89
+ except Exception as e:
90
+ sys.stderr.write(f" skip {fp.name}: {e}\n")
91
+ continue
92
+ if len(pairs) < args.min_pairs:
93
+ continue
94
+ txt = session_to_text(fp)
95
+ if not txt:
96
+ continue
97
+ out_file = out / (fp.parent.name + "__" + fp.stem + ".md")
98
+ out_file.write_text(txt)
99
+ n_written += 1
100
+ if n_written % 20 == 0:
101
+ print(f" wrote {n_written} session texts ({len(txt)} chars latest)")
102
+
103
+ print(f"DONE: wrote {n_written} session texts to {out}")
104
+ return 0
105
+
106
+
107
+ if __name__ == "__main__":
108
+ sys.exit(main())
bin/v3/distiller.py ADDED
@@ -0,0 +1,340 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 V10 β€” universal distiller.
3
+
4
+ Takes any text input (markdown / .md / conversation chunk / agent definition /
5
+ SKILL.md / decision log / source code), distills via Cerebras β†’ Groq β†’
6
+ OpenRouter free tier fallback chain into instruction/response pairs.
7
+
8
+ Output: JSONL with one Q&A per line, format:
9
+ {"prompt": "...", "response": "...", "source": "...", "kind": "..."}
10
+
11
+ CLI:
12
+ distiller.py --input file.md --kind knowledge --out pairs.jsonl
13
+ distiller.py --input dir/ --kind skill --out pairs.jsonl
14
+ distiller.py --stdin --kind conversation --out pairs.jsonl
15
+ """
16
+ from __future__ import annotations
17
+
18
+ import argparse
19
+ import json
20
+ import os
21
+ import re
22
+ import sys
23
+ import time
24
+ import hashlib
25
+ from pathlib import Path
26
+ from urllib import request, error
27
+
28
+ # Free tier rotation
29
+ APIS = [
30
+ # Groq first (Cerebras is currently 429 overloaded). Multiple Groq models
31
+ # for round-robin across token quotas.
32
+ ("groq-llama", "https://api.groq.com/openai/v1/chat/completions",
33
+ "GROQ_API_KEY", "llama-3.3-70b-versatile", 4000),
34
+ ("groq-qwen", "https://api.groq.com/openai/v1/chat/completions",
35
+ "GROQ_API_KEY", "qwen/qwen3-32b", 4000),
36
+ ("groq-gpt", "https://api.groq.com/openai/v1/chat/completions",
37
+ "GROQ_API_KEY", "openai/gpt-oss-120b", 4000),
38
+ # Cerebras (try after Groq β€” better quality but flakey availability)
39
+ ("cerebras-qwen", "https://api.cerebras.ai/v1/chat/completions",
40
+ "CEREBRAS_API_KEY", "qwen-3-235b-a22b-instruct-2507", 4000),
41
+ ("cerebras-gpt", "https://api.cerebras.ai/v1/chat/completions",
42
+ "CEREBRAS_API_KEY", "gpt-oss-120b", 4000),
43
+ # OpenRouter free fallback
44
+ ("openrouter-llama", "https://openrouter.ai/api/v1/chat/completions",
45
+ "OPENROUTER_API_KEY", "meta-llama/llama-3.3-70b-instruct:free", 4000),
46
+ ("openrouter-deepseek", "https://openrouter.ai/api/v1/chat/completions",
47
+ "OPENROUTER_API_KEY_2", "deepseek/deepseek-chat-v3.1:free", 4000),
48
+ ]
49
+
50
+ # Per-kind distillation prompts
51
+ PROMPTS = {
52
+ "knowledge": """You are distilling a knowledge document into instruction/response training
53
+ pairs for a senior code+SRE+DevSecOps LLM (Surrogate-1).
54
+
55
+ Source ({source}):
56
+ \"\"\"
57
+ {text}
58
+ \"\"\"
59
+
60
+ Generate UP TO 8 high-quality Q&A pairs. Each pair:
61
+ - prompt: realistic engineer-asks-engineer question (concrete, not "what is X?")
62
+ - response: expert response, cite real APIs/CLIs/standards, runnable when applicable
63
+ - diverse: mix operational, architectural, debug, review angles
64
+ - 100-500 words per response
65
+
66
+ NO `<thinking>` blocks. NO hedging. NO "as an AI". NO generic intros.
67
+ Output STRICT JSONL only β€” one valid JSON per line:
68
+ {{"prompt":"...","response":"..."}}""",
69
+
70
+ "skill": """You are converting a SKILL.md file into skill demonstration training pairs
71
+ for Surrogate-1.
72
+
73
+ Source SKILL.md ({source}):
74
+ \"\"\"
75
+ {text}
76
+ \"\"\"
77
+
78
+ Generate 5-8 demonstration pairs showing the skill in action:
79
+ - prompt: a real user request that would invoke this skill
80
+ - response: complete expert demonstration of the skill (with the structured
81
+ output format the skill specifies, runnable when applicable)
82
+ - include edge cases + idiomatic usage
83
+
84
+ Output STRICT JSONL only:
85
+ {{"prompt":"...","response":"..."}}""",
86
+
87
+ "agent": """You are converting a Claude agent definition (.md file with frontmatter +
88
+ system prompt) into role-persona training pairs.
89
+
90
+ Agent definition ({source}):
91
+ \"\"\"
92
+ {text}
93
+ \"\"\"
94
+
95
+ Generate 8-12 pairs showing the agent's role behavior:
96
+ - prompt: realistic task that would be delegated to this agent
97
+ - response: agent's expert response IN-ROLE β€” using its specified tools/style/
98
+ output format. Include the agent's system prompt mentality.
99
+
100
+ Output STRICT JSONL only:
101
+ {{"prompt":"...","response":"..."}}""",
102
+
103
+ "decision": """You are extracting technique knowledge from an AI research decision log.
104
+
105
+ Decision log ({source}):
106
+ \"\"\"
107
+ {text}
108
+ \"\"\"
109
+
110
+ Generate 3-5 Q&A pairs that capture:
111
+ - what the technique is + when to use it + how to wire it
112
+ - expected gain + cost + risks
113
+ - comparable alternatives
114
+
115
+ Output STRICT JSONL only:
116
+ {{"prompt":"...","response":"..."}}""",
117
+
118
+ "conversation": """You are extracting (instruction, expert-response) training pairs from a
119
+ real engineer↔assistant conversation transcript.
120
+
121
+ Transcript chunk ({source}):
122
+ \"\"\"
123
+ {text}
124
+ \"\"\"
125
+
126
+ Find moments where:
127
+ 1. The engineer asked a real technical question or gave a real task
128
+ 2. The assistant gave a useful expert answer (skip filler/clarification turns)
129
+ 3. Skip moments where the assistant made the user angry or got it wrong
130
+ (you can tell from later corrections in the transcript)
131
+
132
+ Generate 3-8 pairs distilling the GOOD moments. Make them self-contained
133
+ (no "as we discussed" β€” restate the context).
134
+
135
+ Output STRICT JSONL only:
136
+ {{"prompt":"...","response":"..."}}""",
137
+
138
+ "memory": """You are converting a personal-knowledge memory file (lessons learned,
139
+ preferences, past mistakes/fixes) into training pairs that bake the lesson
140
+ into the model.
141
+
142
+ Memory file ({source}):
143
+ \"\"\"
144
+ {text}
145
+ \"\"\"
146
+
147
+ Generate 3-6 pairs where:
148
+ - prompt = the kind of question that would trigger the lesson
149
+ - response = the lesson applied correctly (including "don't do X" patterns
150
+ rephrased as "do Y instead because Z")
151
+
152
+ Output STRICT JSONL only:
153
+ {{"prompt":"...","response":"..."}}""",
154
+ }
155
+
156
+
157
+ def _hash(text: str) -> str:
158
+ return hashlib.sha256(text.encode()).hexdigest()[:16]
159
+
160
+
161
+ def _call_api(api_name, url, key_env, model, max_tokens, prompt):
162
+ key = os.environ.get(key_env, "")
163
+ if not key:
164
+ return None, f"no key {key_env}"
165
+ body = {
166
+ "model": model,
167
+ "messages": [{"role": "user", "content": prompt}],
168
+ "max_tokens": max_tokens,
169
+ "temperature": 0.45,
170
+ }
171
+ try:
172
+ req = request.Request(
173
+ url,
174
+ data=json.dumps(body).encode(),
175
+ headers={"Authorization": f"Bearer {key}",
176
+ "Content-Type": "application/json",
177
+ "User-Agent": "surrogate-1-distiller/1.0",
178
+ "Accept": "application/json"},
179
+ )
180
+ with request.urlopen(req, timeout=45) as r:
181
+ d = json.loads(r.read().decode())
182
+ return d["choices"][0]["message"]["content"], None
183
+ except error.HTTPError as e:
184
+ try:
185
+ body_err = e.read().decode()[:300]
186
+ except Exception:
187
+ body_err = ""
188
+ return None, f"{api_name} HTTP {e.code}: {body_err[:200]}"
189
+ except Exception as e:
190
+ return None, f"{api_name} {type(e).__name__}: {str(e)[:200]}"
191
+
192
+
193
+ def call_frontier(prompt: str) -> str | None:
194
+ """Try each API in order; return first success. On 429/rate-limit, skip
195
+ to next API immediately (no sleep β€” they have separate quotas)."""
196
+ errors = []
197
+ for api_name, url, key_env, model, max_tokens in APIS:
198
+ out, err = _call_api(api_name, url, key_env, model, max_tokens, prompt)
199
+ if out:
200
+ return out
201
+ if err:
202
+ errors.append(f"{api_name}: {err[:80]}")
203
+ sys.stderr.write(f"all {len(APIS)} APIs failed:\n " + "\n ".join(errors) + "\n")
204
+ return None
205
+
206
+
207
+ def parse_jsonl(text: str) -> list[dict]:
208
+ """Extract valid JSON pairs from frontier output (it sometimes wraps with
209
+ code fences or adds prose)."""
210
+ pairs = []
211
+ for L in text.splitlines():
212
+ L = L.strip()
213
+ if not L or L.startswith("```") or L.startswith("//"):
214
+ continue
215
+ try:
216
+ j = json.loads(L)
217
+ if isinstance(j, dict) and "prompt" in j and "response" in j:
218
+ pairs.append(j)
219
+ except Exception:
220
+ continue
221
+ # also try multi-line { ... } blocks if line-based failed
222
+ if not pairs:
223
+ for m in re.finditer(r'\{\s*"prompt"\s*:.*?\}\s*(?=\{|\Z)', text, re.S):
224
+ try:
225
+ j = json.loads(m.group(0))
226
+ if "prompt" in j and "response" in j:
227
+ pairs.append(j)
228
+ except Exception:
229
+ continue
230
+ return pairs
231
+
232
+
233
+ def chunk_text(text: str, max_chars: int = 8000, overlap: int = 200) -> list[str]:
234
+ """Split long text into overlapping chunks distillable in one API call."""
235
+ if len(text) <= max_chars:
236
+ return [text]
237
+ chunks = []
238
+ i = 0
239
+ while i < len(text):
240
+ chunk = text[i:i + max_chars]
241
+ # try to break on paragraph boundary near the end
242
+ if i + max_chars < len(text):
243
+ cut = chunk.rfind("\n\n")
244
+ if cut > max_chars // 2:
245
+ chunk = chunk[:cut]
246
+ chunks.append(chunk)
247
+ i += len(chunk) - overlap
248
+ return chunks
249
+
250
+
251
+ def distill_one(text: str, source: str, kind: str,
252
+ seen_hashes: set, out_handle) -> int:
253
+ """Distill one text blob, write pairs to out_handle, return # pairs added."""
254
+ if kind not in PROMPTS:
255
+ sys.stderr.write(f"unknown kind: {kind}\n"); return 0
256
+ chunks = chunk_text(text)
257
+ n_added = 0
258
+ for chunk in chunks:
259
+ prompt = PROMPTS[kind].format(text=chunk, source=source)
260
+ raw = call_frontier(prompt)
261
+ if not raw:
262
+ continue
263
+ pairs = parse_jsonl(raw)
264
+ for p in pairs:
265
+ h = _hash(p["prompt"][:200] + p["response"][:200])
266
+ if h in seen_hashes:
267
+ continue
268
+ seen_hashes.add(h)
269
+ p["source"] = source
270
+ p["kind"] = kind
271
+ out_handle.write(json.dumps(p, ensure_ascii=False) + "\n")
272
+ out_handle.flush()
273
+ n_added += 1
274
+ return n_added
275
+
276
+
277
+ def main() -> int:
278
+ p = argparse.ArgumentParser()
279
+ p.add_argument("--input", help="file or dir; use --stdin for piped input")
280
+ p.add_argument("--stdin", action="store_true")
281
+ p.add_argument("--kind", required=True, choices=list(PROMPTS.keys()))
282
+ p.add_argument("--out", required=True, help="JSONL output path (append)")
283
+ p.add_argument("--limit", type=int, default=0,
284
+ help="cap on # files processed (0=all)")
285
+ p.add_argument("--source-tag", default="",
286
+ help="optional source label override")
287
+ args = p.parse_args()
288
+
289
+ out_path = Path(args.out)
290
+ out_path.parent.mkdir(parents=True, exist_ok=True)
291
+
292
+ # Resume support β€” load existing hashes
293
+ seen = set()
294
+ if out_path.exists():
295
+ for L in out_path.read_text().splitlines():
296
+ try:
297
+ j = json.loads(L)
298
+ seen.add(_hash(j["prompt"][:200] + j["response"][:200]))
299
+ except Exception:
300
+ continue
301
+
302
+ f_out = out_path.open("a")
303
+ n_total = 0
304
+
305
+ if args.stdin:
306
+ text = sys.stdin.read()
307
+ n = distill_one(text, args.source_tag or "stdin", args.kind, seen, f_out)
308
+ n_total += n
309
+ print(f" stdin β†’ +{n} pairs")
310
+ elif Path(args.input).is_file():
311
+ text = Path(args.input).read_text(errors="replace")
312
+ n = distill_one(text, args.source_tag or args.input, args.kind, seen, f_out)
313
+ n_total += n
314
+ print(f" {args.input} β†’ +{n} pairs")
315
+ elif Path(args.input).is_dir():
316
+ files = sorted(Path(args.input).rglob("*.md"))
317
+ if args.limit:
318
+ files = files[:args.limit]
319
+ for i, fp in enumerate(files):
320
+ try:
321
+ text = fp.read_text(errors="replace")
322
+ except Exception as e:
323
+ sys.stderr.write(f" skip {fp}: {e}\n")
324
+ continue
325
+ if len(text) < 100:
326
+ continue
327
+ n = distill_one(text, str(fp), args.kind, seen, f_out)
328
+ n_total += n
329
+ print(f" [{i+1}/{len(files)}] {fp.name} β†’ +{n} (total {n_total})")
330
+ else:
331
+ sys.stderr.write(f"--input not found: {args.input}\n")
332
+ return 2
333
+
334
+ f_out.close()
335
+ print(f"DONE: total {n_total} pairs written to {out_path}")
336
+ return 0
337
+
338
+
339
+ if __name__ == "__main__":
340
+ sys.exit(main())
bin/v3/feature-extractor.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 V10 β€” feature-request extractor.
3
+
4
+ Read all 748 past Claude conversation .jsonl files in ~/.claude/projects/
5
+ and extract every moment where the user asked Claude to BUILD a feature
6
+ in Surrogate. These become FEATURE-DEMONSTRATION training pairs that
7
+ get distilled into Surrogate's weights β€” so the model itself knows how
8
+ to build the feature when asked again.
9
+
10
+ Format of output JSONL (consumed by trainer via merge_external):
11
+ {
12
+ "prompt": "<user feature request, made self-contained>",
13
+ "response": "<concrete implementation: code/diff/spec/etc>",
14
+ "kind": "feature-build",
15
+ "source": "<session-id>",
16
+ "feature_class": "<heuristic-tag>"
17
+ }
18
+
19
+ Heuristic feature-class tags pulled from the user's actual asks:
20
+ - "agent-coding-loop" β€” autonomous coding agent / dev chain
21
+ - "monitoring" β€” 24/7 monitor, watchdog, observability
22
+ - "training-technique" β€” new technique to add to trainer (RL, DPO, etc.)
23
+ - "orchestration" β€” multi-agent, spawn/aggregate, parallel
24
+ - "knowledge-ingest" β€” distill X into model
25
+ - "self-improvement" β€” closed-loop improvement, flywheel
26
+ - "release-pipeline" β€” autonomous release, CI/CD, draft PR
27
+ - "incident-response" β€” auto-heal, diagnose, fix
28
+ - "evaluation" β€” bench, scoring, verdict
29
+ - "cost-efficiency" β€” frontier-style smarter-with-less
30
+ """
31
+ from __future__ import annotations
32
+
33
+ import argparse
34
+ import json
35
+ import re
36
+ import sys
37
+ from pathlib import Path
38
+
39
+ PROJECTS = Path.home() / ".claude/projects"
40
+
41
+ # Heuristic patterns β€” match Thai + English request-style language
42
+ REQUEST_PATTERNS = [
43
+ # Direct imperative "build/make/add/create X"
44
+ r"\b(build|make|add|create|implement|wire|train|bake|ingest|distill)\s+\w+",
45
+ # Thai imperatives
46
+ r"(ΰΈ—ΰΈ³|ΰΈͺΰΈ£ΰΉ‰ΰΈ²ΰΈ‡|ΰΉ€ΰΈžΰΈ΄ΰΉˆΰΈ‘|implement|train|ΰΉ€ΰΈ—ΰΈ£ΰΈ™|fine-?tune|fine_tune)\s+",
47
+ # Feature-shaped asks
48
+ r"feature\s+",
49
+ r"(can|please|let|let's|how about)\s+",
50
+ # explicit "I want X" / "ผฑอฒาก"
51
+ r"(I want|I need|I'd like|ผฑอฒาก|ΰΈœΰΈ‘ΰΈ•ΰΉ‰ΰΈ­ΰΈ‡ΰΈΰΈ²ΰΈ£|อฒาก|ต้องการ)",
52
+ # research-asks that lead to features
53
+ r"(go research|research|find out|ΰΈ«ΰΈ²ΰΈ‘ΰΈ²|ΰΈ₯ΰΈ­ΰΈ‡ΰΈ«ΰΈ²|วิจัฒ)",
54
+ ]
55
+
56
+ FEATURE_CLASS_KEYWORDS = {
57
+ "agent-coding-loop": ["agent chain", "dev agent", "autonomous coding", "code 24", "spawn agent", "sub-?agent", "team agent"],
58
+ "monitoring": ["monitor", "watchdog", "observability", "smoke", "health check", "anomaly", "incident detect"],
59
+ "training-technique": ["GRPO", "DPO", "DAPO", "ORPO", "KTO", "RLVR", "SimPO", "Constitutional AI", "TruthRL", "Mask-DPO", "PiSSA", "LoftQ", "CorDA", "DoRA", "RSLoRA", "NEFTune", "Spectrum", "Quiet-STaR", "Reflexion", "Voyager", "Magpie", "self-rewarding", "knowledge distillation", "DistillKit", "active learning", "CoT", "PRM", "verify"],
60
+ "orchestration": ["multi-agent", "spawn", "aggregate", "parallel", "orchestrat", "team", "shared context", "subagent", "dispatch"],
61
+ "knowledge-ingest": ["ingest", "distill", "corpus", "training data", "training pair", "Q&A", "vault", "obsidian", "memory"],
62
+ "self-improvement": ["self-improve", "flywheel", "improve over time", "ΰΉ€ΰΈΰΉˆΰΈ‡ΰΈ‚ΰΈΆΰΉ‰ΰΈ™", "online RL", "continual learn"],
63
+ "release-pipeline": ["release", "draft PR", "CI/CD", "deploy", "MVP", "v1 v2", "v10000", "ship feature", "auto-release"],
64
+ "incident-response": ["incident", "auto-heal", "rollback", "diagnose", "patch", "remediate", "fix"],
65
+ "evaluation": ["bench", "eval", "score", "verdict", "rubric", "HumanEval", "MBPP", "BFCL", "SWE-Bench", "test", "ΰΈ§ΰΈ±ΰΈ”ΰΈœΰΈ₯"],
66
+ "cost-efficiency": ["smarter with less", "frontier", "efficiency", "speculative", "MoE", "sliding window", "test-time compute", "ΰΉƒΰΈŠΰΉ‰ resource ΰΈ™ΰΉ‰ΰΈ­ΰΈ’"],
67
+ "role-persona": ["SRE", "DevSecOps", "Full Stack", "PM", "PO", "SA", "BD", "QE", "AI Engineer", "Marketing", "role", "persona", "engineer"],
68
+ "long-context": ["long context", "32K", "64K", "128K", "1M", "YaRN", "RoPE", "context window"],
69
+ "anti-hallucination": ["halluc", "ΰΈ«ΰΈ₯ΰΈ­ΰΈ™", "factual", "truth", "correctness", "F-DPO", "TruthRL", "calibration"],
70
+ }
71
+
72
+
73
+ def classify(text: str) -> str:
74
+ text_lower = text.lower()
75
+ scores = {}
76
+ for cls, kws in FEATURE_CLASS_KEYWORDS.items():
77
+ for kw in kws:
78
+ if kw.lower() in text_lower:
79
+ scores[cls] = scores.get(cls, 0) + 1
80
+ if not scores:
81
+ return "general"
82
+ return max(scores, key=scores.get)
83
+
84
+
85
+ def looks_like_feature_request(user_msg: str) -> bool:
86
+ """True if the user message is asking for something to be built."""
87
+ if len(user_msg) < 30:
88
+ return False
89
+ if len(user_msg) > 4000:
90
+ # Very long messages are usually critique/anger, not feature requests
91
+ # but they often contain feature requests inside
92
+ pass
93
+ for pat in REQUEST_PATTERNS:
94
+ if re.search(pat, user_msg, re.I):
95
+ return True
96
+ return False
97
+
98
+
99
+ def extract_pairs_from_session(jsonl_path: Path) -> list[dict]:
100
+ """For each (user, assistant) pair where the user asked for a feature,
101
+ return a training-ready dict."""
102
+ out = []
103
+ last_user = None
104
+ for L in jsonl_path.read_text(errors="replace").splitlines():
105
+ try:
106
+ ev = json.loads(L)
107
+ except Exception:
108
+ continue
109
+ msg = ev.get("message", {})
110
+ role = msg.get("role")
111
+ content = msg.get("content", "")
112
+ if isinstance(content, list):
113
+ content = "\n".join(b.get("text", "") for b in content
114
+ if isinstance(b, dict) and b.get("type") == "text")
115
+ if not isinstance(content, str):
116
+ content = str(content)
117
+ if role == "user":
118
+ last_user = content
119
+ elif role == "assistant" and last_user:
120
+ if looks_like_feature_request(last_user):
121
+ # Make prompt self-contained β€” restate context briefly
122
+ # Skip if assistant response is too short or just clarification
123
+ if len(content) >= 100 and not content.startswith("?"):
124
+ out.append({
125
+ "prompt": last_user[:3000],
126
+ "response": content[:6000],
127
+ "kind": "feature-build",
128
+ "source": jsonl_path.stem,
129
+ "feature_class": classify(last_user + " " + content),
130
+ })
131
+ last_user = None
132
+ return out
133
+
134
+
135
+ def main() -> int:
136
+ p = argparse.ArgumentParser()
137
+ p.add_argument("--out", default=str(Path.home() / ".surrogate/state/v10-ingest/conversations/feature-builds.jsonl"))
138
+ p.add_argument("--limit", type=int, default=0)
139
+ args = p.parse_args()
140
+
141
+ out_path = Path(args.out)
142
+ out_path.parent.mkdir(parents=True, exist_ok=True)
143
+ f_out = out_path.open("w")
144
+ n_total = 0
145
+ n_files = 0
146
+ by_class = {}
147
+ files = sorted(PROJECTS.rglob("*.jsonl"))
148
+ if args.limit:
149
+ files = files[:args.limit]
150
+ print(f"scanning {len(files)} session files for feature-build pairs...")
151
+ for fp in files:
152
+ try:
153
+ pairs = extract_pairs_from_session(fp)
154
+ except Exception as e:
155
+ sys.stderr.write(f" skip {fp.name}: {e}\n")
156
+ continue
157
+ if not pairs:
158
+ continue
159
+ n_files += 1
160
+ for j in pairs:
161
+ f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
162
+ by_class[j["feature_class"]] = by_class.get(j["feature_class"], 0) + 1
163
+ n_total += 1
164
+ if n_files % 50 == 0:
165
+ print(f" scanned {n_files} sessions, extracted {n_total} feature-build pairs")
166
+ f_out.close()
167
+ print(f"\nDONE: {n_total} feature-build pairs from {n_files} sessions β†’ {out_path}")
168
+ print("By feature class:")
169
+ for cls, n in sorted(by_class.items(), key=lambda x: -x[1]):
170
+ print(f" {n:5d} {cls}")
171
+ return 0
172
+
173
+
174
+ if __name__ == "__main__":
175
+ sys.exit(main())
bin/v3/ingest-all.sh ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 V10 β€” orchestrate ALL ingestion sources β†’ distill β†’ push HF.
3
+ #
4
+ # Sources (audit found 715+ artifacts, 0% had been ingested):
5
+ # - Obsidian Vault 591 .md files
6
+ # - .claude/memory 27 files
7
+ # - SKILL.md 68 files
8
+ # - Claude agents 25 files (6 user + 19 plugin)
9
+ # - arkship decisions 31 files
10
+ # - past conversations 748 jsonl files = 647 MB of dialogue
11
+ #
12
+ # Strategy: each source kind β†’ its own JSONL β†’ its own HF dataset
13
+ # (so V10 trainer can weight them differently via merge_external).
14
+ #
15
+ # Free-tier API rotation handles rate limits across Cerebras+Groq+OpenRouter.
16
+ # All output is checkpointed (resume-friendly): re-running won't redo done work.
17
+ set -uo pipefail
18
+ [[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
19
+
20
+ V3="$HOME/.surrogate/hf-space/bin/v3"
21
+ ING="$HOME/.surrogate/state/v10-ingest"
22
+ LOG="$HOME/.surrogate/logs/ingest-all.log"
23
+ mkdir -p "$ING" "$(dirname "$LOG")"
24
+
25
+ log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
26
+
27
+ notify() {
28
+ [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
29
+ curl -s -X POST -H "Content-Type: application/json" \
30
+ -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
31
+ "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
32
+ }
33
+
34
+ ingest_dir() {
35
+ local src_dir="$1" kind="$2" out_jsonl="$3" repo="$4" limit="${5:-0}"
36
+ log ""
37
+ log "── INGEST: $kind from $src_dir ──"
38
+ [[ ! -d "$src_dir" ]] && { log " skip: $src_dir not a dir"; return 0; }
39
+ local extra=""
40
+ [[ "$limit" != "0" ]] && extra="--limit $limit"
41
+ python3 "$V3/distiller.py" --input "$src_dir" --kind "$kind" \
42
+ --out "$out_jsonl" $extra 2>&1 | tee -a "$LOG" | tail -10
43
+ if [[ -s "$out_jsonl" ]]; then
44
+ local n; n=$(wc -l < "$out_jsonl" | tr -d ' ')
45
+ log " β†’ $n pairs in $out_jsonl"
46
+ if (( n > 50 )); then
47
+ log " pushing β†’ $repo"
48
+ python3 "$V3/push-hf-dataset.py" --input "$out_jsonl" --repo "$repo" 2>&1 | tail -5 | tee -a "$LOG"
49
+ fi
50
+ fi
51
+ }
52
+
53
+ ingest_file() {
54
+ local src="$1" kind="$2" out_jsonl="$3" repo="$4"
55
+ [[ ! -f "$src" ]] && return 0
56
+ log ""
57
+ log "── INGEST FILE: $kind from $src ──"
58
+ python3 "$V3/distiller.py" --input "$src" --kind "$kind" \
59
+ --out "$out_jsonl" --source-tag "$src" 2>&1 | tail -3
60
+ [[ -s "$out_jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
61
+ --input "$out_jsonl" --repo "$repo" 2>&1 | tail -3
62
+ }
63
+
64
+ log "═══ V10 INGEST PIPELINE STARTING ═══"
65
+ notify "πŸ” V10 ingest pipeline starting β€” 715+ artifacts β†’ 7 HF datasets"
66
+
67
+ # ── 1. Conversations: extract first, then distill ──────────────────────────
68
+ log ""
69
+ log "── STEP 1: extract past conversations ──"
70
+ python3 "$V3/conversation-extractor.py" --limit 0 --min-pairs 3 2>&1 | tail -5 | tee -a "$LOG"
71
+ log "── distill conversations ──"
72
+ ingest_dir "$ING/conversations/text" "conversation" \
73
+ "$ING/conversations/pairs.jsonl" \
74
+ "axentx/surrogate-1-conversations" 0
75
+
76
+ # ── 2. Obsidian Vault knowledge ─────────────────────────────────────────────
77
+ ingest_dir "$HOME/Documents/Obsidian Vault/AI-Hub/knowledge" "knowledge" \
78
+ "$ING/vault/knowledge.jsonl" \
79
+ "axentx/surrogate-1-knowledge-vault" 0
80
+
81
+ ingest_dir "$HOME/Documents/Obsidian Vault/AI-Hub/patterns" "knowledge" \
82
+ "$ING/vault/patterns.jsonl" \
83
+ "axentx/surrogate-1-knowledge-patterns" 0
84
+
85
+ # ── 3. .claude/memory ───────────────────────────────────────────────────────
86
+ ingest_dir "$HOME/.claude/memory" "memory" \
87
+ "$ING/memory/lessons.jsonl" \
88
+ "axentx/surrogate-1-knowledge-memory" 0
89
+
90
+ # ── 4. SKILL.md mirror (Anthropic + community) ──────────────────────────────
91
+ log ""
92
+ log "── INGEST: skills (Anthropic + community SKILL.md) ──"
93
+ mkdir -p "$ING/skills"
94
+ find "$HOME/Documents/Obsidian Vault/AI-Hub/skills" \
95
+ "$HOME/.claude/skills" \
96
+ "$HOME/.claude/plugins/cache" \
97
+ -name "SKILL.md" -type f 2>/dev/null \
98
+ | while read -r f; do
99
+ python3 "$V3/distiller.py" --input "$f" --kind "skill" \
100
+ --out "$ING/skills/skills.jsonl" --source-tag "$f" 2>&1 | tail -2
101
+ done | tee -a "$LOG" | tail -20
102
+ [[ -s "$ING/skills/skills.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
103
+ --input "$ING/skills/skills.jsonl" \
104
+ --repo "axentx/surrogate-1-skills-mirror" 2>&1 | tail -3
105
+
106
+ # ── 5. Claude agents (user + plugin) ────────────────────────────────────────
107
+ log ""
108
+ log "── INGEST: agents (user + plugin) ──"
109
+ mkdir -p "$ING/agents"
110
+ {
111
+ find "$HOME/.claude/agents" -name "*.md" -type f 2>/dev/null
112
+ find "$HOME/.claude/plugins" -path "*agents*" -name "*.md" -type f 2>/dev/null
113
+ } | sort -u | while read -r f; do
114
+ python3 "$V3/distiller.py" --input "$f" --kind "agent" \
115
+ --out "$ING/agents/personas.jsonl" --source-tag "$f" 2>&1 | tail -2
116
+ done | tee -a "$LOG" | tail -10
117
+ [[ -s "$ING/agents/personas.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
118
+ --input "$ING/agents/personas.jsonl" \
119
+ --repo "axentx/surrogate-1-roles-claude-builtin" 2>&1 | tail -3
120
+
121
+ # ── 6. arkship decisions (technique research) ───────────────────────────────
122
+ ingest_dir "$HOME/axentx/arkship/decisions" "decision" \
123
+ "$ING/decisions/techniques.jsonl" \
124
+ "axentx/surrogate-1-arkship-decisions" 0
125
+
126
+ # ── 7. all axentx project decisions (Costinel/Vanguard/etc) ────────────────
127
+ log ""
128
+ log "── INGEST: axentx project decisions (cross-project) ──"
129
+ mkdir -p "$ING/decisions"
130
+ find "$HOME/axentx" -maxdepth 4 -name "decisions" -type d 2>/dev/null \
131
+ | while read -r d; do
132
+ log " scanning $d"
133
+ python3 "$V3/distiller.py" --input "$d" --kind "decision" \
134
+ --out "$ING/decisions/all-axentx.jsonl" --source-tag "$d" 2>&1 | tail -3
135
+ done | tee -a "$LOG" | tail -20
136
+ [[ -s "$ING/decisions/all-axentx.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
137
+ --input "$ING/decisions/all-axentx.jsonl" \
138
+ --repo "axentx/surrogate-1-axentx-decisions" 2>&1 | tail -3
139
+
140
+ # ── 8. surrogate trainer history (160 commits) β†’ meta-learning corpus ──────
141
+ log ""
142
+ log "── INGEST: hf-space self-development history ──"
143
+ mkdir -p "$ING/self-trace"
144
+ cd "$HOME/.surrogate/hf-space" 2>/dev/null && {
145
+ git log --pretty=format:"%H%n%an%n%aI%n%s%n%n%b%n---END---" \
146
+ --since="60 days ago" --no-merges \
147
+ > "$ING/self-trace/hf-space-history.txt" 2>/dev/null
148
+ log " hf-space history dumped: $(wc -l < "$ING/self-trace/hf-space-history.txt" | tr -d ' ') lines"
149
+ }
150
+ [[ -s "$ING/self-trace/hf-space-history.txt" ]] && python3 "$V3/distiller.py" \
151
+ --input "$ING/self-trace/hf-space-history.txt" \
152
+ --kind "decision" \
153
+ --out "$ING/self-trace/dev-trace.jsonl" \
154
+ --source-tag "hf-space-git-history" 2>&1 | tail -5
155
+
156
+ # ── final summary + push merged ──────────────────────────────────────────────
157
+ log ""
158
+ log "═══ V10 INGEST SUMMARY ═══"
159
+ total=0
160
+ for f in "$ING"/*/*.jsonl; do
161
+ [[ -s "$f" ]] || continue
162
+ n=$(wc -l < "$f" | tr -d ' ')
163
+ total=$((total + n))
164
+ log " $(basename $(dirname "$f"))/$(basename "$f"): $n pairs"
165
+ done
166
+ log " ────────────────────────"
167
+ log " TOTAL: $total pairs across all sinks"
168
+
169
+ # build merged for trainer convenience
170
+ mkdir -p "$ING/merged"
171
+ cat "$ING"/*/*.jsonl 2>/dev/null > "$ING/merged/all.jsonl" 2>/dev/null
172
+ if [[ -s "$ING/merged/all.jsonl" ]]; then
173
+ log " merged total: $(wc -l < "$ING/merged/all.jsonl" | tr -d ' ') pairs"
174
+ python3 "$V3/push-hf-dataset.py" \
175
+ --input "$ING/merged/all.jsonl" \
176
+ --repo "axentx/surrogate-1-v10-merged" 2>&1 | tail -3
177
+ fi
178
+
179
+ notify "βœ… V10 ingest done β€” $total pairs across 7 HF datasets, ready for V10 trainer"
180
+ log "═══ DONE ═══"
bin/v3/kaggle-ingest-kernel.py ADDED
@@ -0,0 +1,373 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Surrogate-1 V10 β€” Kaggle ingest kernel.
3
+
4
+ Runs ENTIRELY on Kaggle (T4Γ—2 free). NO Mac compute.
5
+
6
+ Pipeline:
7
+ 1. Pull source bundle from axentx/surrogate-1-v10-source-bundle (tar.gz)
8
+ 2. Extract per-kind source files (vault, memory, skills, agents, decisions,
9
+ conversations, feature-builds)
10
+ 3. Distill each via Cerebras β†’ Groq β†’ OpenRouter free chain
11
+ 4. Push results to 7+ axentx/surrogate-1-knowledge-* HF datasets
12
+
13
+ Setup in Kaggle UI:
14
+ - Add-ons β†’ Secrets β†’ attach: HF_TOKEN, CEREBRAS_API_KEY, GROQ_API_KEY,
15
+ OPENROUTER_API_KEY (all already attached from V8 trainer)
16
+
17
+ Outputs (these axentx/* datasets will be CREATED if missing β€” fixes the
18
+ 404s the audit found):
19
+ axentx/surrogate-1-knowledge-vault ← Obsidian Vault distilled Q&A
20
+ axentx/surrogate-1-knowledge-memory ← .claude/memory lessons
21
+ axentx/surrogate-1-knowledge-patterns ← Vault/patterns/
22
+ axentx/surrogate-1-skills-mirror ← 68 SKILL.md demonstrations
23
+ axentx/surrogate-1-roles-claude-builtin ← 25 agent role personas
24
+ axentx/surrogate-1-arkship-decisions ← 31 technique research
25
+ axentx/surrogate-1-axentx-decisions ← cross-project decisions
26
+ axentx/surrogate-1-conversations ← past conversation Q&A
27
+ axentx/surrogate-1-feature-builds ← extracted feature requests
28
+ + their expert build responses
29
+
30
+ Pickup by V10 trainer: kaggle-trainer.sh's merge_external() calls already
31
+ support these dataset names β€” no trainer changes needed.
32
+
33
+ Usage:
34
+ Save Version on Kaggle. Run. Output appears in HF datasets.
35
+ """
36
+ from __future__ import annotations
37
+
38
+ import os
39
+ import json
40
+ import re
41
+ import sys
42
+ import time
43
+ import tarfile
44
+ import hashlib
45
+ import subprocess
46
+ from pathlib import Path
47
+ from urllib import request, error
48
+
49
+ # ── Bootstrap Kaggle Secrets β†’ env (Kaggle Secrets aren't auto-injected) ────
50
+ try:
51
+ from kaggle_secrets import UserSecretsClient
52
+ _us = UserSecretsClient()
53
+ for k in ("HF_TOKEN", "CEREBRAS_API_KEY", "GROQ_API_KEY",
54
+ "OPENROUTER_API_KEY", "OPENROUTER_API_KEY_2", "ANTHROPIC_API_KEY"):
55
+ if k not in os.environ:
56
+ try: os.environ[k] = _us.get_secret(k)
57
+ except Exception: pass
58
+ print("Kaggle secrets bootstrapped:",
59
+ [k for k in ("HF_TOKEN","CEREBRAS_API_KEY","GROQ_API_KEY",
60
+ "OPENROUTER_API_KEY") if os.environ.get(k)])
61
+ except ImportError:
62
+ print("not on Kaggle β€” env must come from shell")
63
+
64
+ # Install deps
65
+ print("Installing deps...")
66
+ subprocess.run([sys.executable, "-m", "pip", "install", "-q",
67
+ "huggingface_hub", "datasets"], check=False)
68
+
69
+ from huggingface_hub import HfApi, create_repo, snapshot_download
70
+
71
+ WORK = Path("/kaggle/working/v10-ingest")
72
+ WORK.mkdir(parents=True, exist_ok=True)
73
+ SRC_BUNDLE_REPO = "axentx/surrogate-1-v10-source-bundle"
74
+ SRC_BUNDLE_FILE = "bundle.tar.gz"
75
+
76
+ # ── Free-tier API rotation (User-Agent fix to avoid 403) ────────────────────
77
+ APIS = [
78
+ ("groq-llama", "https://api.groq.com/openai/v1/chat/completions",
79
+ "GROQ_API_KEY", "llama-3.3-70b-versatile"),
80
+ ("groq-qwen", "https://api.groq.com/openai/v1/chat/completions",
81
+ "GROQ_API_KEY", "qwen/qwen3-32b"),
82
+ ("groq-gpt", "https://api.groq.com/openai/v1/chat/completions",
83
+ "GROQ_API_KEY", "openai/gpt-oss-120b"),
84
+ ("cerebras-qwen", "https://api.cerebras.ai/v1/chat/completions",
85
+ "CEREBRAS_API_KEY", "qwen-3-235b-a22b-instruct-2507"),
86
+ ("cerebras-gpt", "https://api.cerebras.ai/v1/chat/completions",
87
+ "CEREBRAS_API_KEY", "gpt-oss-120b"),
88
+ ("openrouter1", "https://openrouter.ai/api/v1/chat/completions",
89
+ "OPENROUTER_API_KEY", "meta-llama/llama-3.3-70b-instruct:free"),
90
+ ("openrouter2", "https://openrouter.ai/api/v1/chat/completions",
91
+ "OPENROUTER_API_KEY_2", "deepseek/deepseek-chat-v3.1:free"),
92
+ ]
93
+
94
+ PROMPTS = {
95
+ "knowledge": (
96
+ "Distill the document into 4-8 instruction/response training pairs for a "
97
+ "senior code+SRE+DevSecOps LLM (Surrogate-1).\n"
98
+ "Source ({source}):\n```\n{text}\n```\n"
99
+ "Each pair: realistic engineer question + expert response with real APIs/CLIs/standards. "
100
+ "100-500 words/response. NO `<thinking>` blocks. NO hedging. "
101
+ "Output ONLY valid JSONL, one per line: "
102
+ '{{"prompt":"...","response":"..."}}'
103
+ ),
104
+ "memory": (
105
+ "Convert this lessons-learned/preferences/past-mistake memory file into 3-6 training pairs "
106
+ "that BAKE the lesson into the model.\n"
107
+ "Source ({source}):\n```\n{text}\n```\n"
108
+ "Each pair: prompt = the kind of question that would trigger the lesson. "
109
+ "response = the lesson applied correctly. "
110
+ "Output ONLY valid JSONL: "
111
+ '{{"prompt":"...","response":"..."}}'
112
+ ),
113
+ "skill": (
114
+ "Convert this SKILL.md into 5-8 skill-demonstration training pairs.\n"
115
+ "Source ({source}):\n```\n{text}\n```\n"
116
+ "Each pair: prompt = realistic user request invoking the skill. "
117
+ "response = complete expert demonstration in the skill's specified format. "
118
+ "Output ONLY valid JSONL: "
119
+ '{{"prompt":"...","response":"..."}}'
120
+ ),
121
+ "agent": (
122
+ "Convert this Claude agent definition into 8-12 role-persona training pairs.\n"
123
+ "Source ({source}):\n```\n{text}\n```\n"
124
+ "Each pair: prompt = realistic task delegated to this agent. "
125
+ "response = agent's expert response IN-ROLE using its specified output format. "
126
+ "Output ONLY valid JSONL: "
127
+ '{{"prompt":"...","response":"..."}}'
128
+ ),
129
+ "decision": (
130
+ "Extract technique knowledge from this AI research decision log into 3-5 Q&A pairs.\n"
131
+ "Source ({source}):\n```\n{text}\n```\n"
132
+ "Capture: what + when to use + how to wire + expected gain + cost + alternatives. "
133
+ "Output ONLY valid JSONL: "
134
+ '{{"prompt":"...","response":"..."}}'
135
+ ),
136
+ "conversation": (
137
+ "From this engineer↔assistant conversation transcript, extract 3-8 (instruction, "
138
+ "expert-response) pairs from the GOOD moments (skip filler, skip wrong-answers "
139
+ "the user later corrected). Make pairs self-contained.\n"
140
+ "Source ({source}):\n```\n{text}\n```\n"
141
+ "Output ONLY valid JSONL: "
142
+ '{{"prompt":"...","response":"..."}}'
143
+ ),
144
+ "feature-build": (
145
+ "The user asked Claude to BUILD a feature in Surrogate-1 model. Convert this "
146
+ "request β†’ expert-implementation pair into a FEATURE-BUILD training pair so the "
147
+ "model itself learns to handle similar requests.\n"
148
+ "User request: ```{text}```\n"
149
+ "Source ({source})\n"
150
+ "Output ONE valid JSON: "
151
+ '{{"prompt":"<self-contained restatement of the build request>",'
152
+ '"response":"<concrete expert build: code/diff/spec/architecture>"}}'
153
+ ),
154
+ }
155
+
156
+
157
+ def call_api(name, url, key_env, model, prompt, max_tokens=4000, timeout=45):
158
+ key = os.environ.get(key_env, "")
159
+ if not key:
160
+ return None, f"no {key_env}"
161
+ try:
162
+ req = request.Request(
163
+ url,
164
+ data=json.dumps({
165
+ "model": model,
166
+ "messages": [{"role": "user", "content": prompt}],
167
+ "max_tokens": max_tokens,
168
+ "temperature": 0.4,
169
+ }).encode(),
170
+ headers={"Authorization": f"Bearer {key}",
171
+ "Content-Type": "application/json",
172
+ "User-Agent": "surrogate-1-distiller/1.0",
173
+ "Accept": "application/json"},
174
+ )
175
+ with request.urlopen(req, timeout=timeout) as r:
176
+ d = json.loads(r.read().decode())
177
+ return d["choices"][0]["message"]["content"], None
178
+ except error.HTTPError as e:
179
+ try: msg = e.read().decode()[:200]
180
+ except Exception: msg = ""
181
+ return None, f"{name} {e.code}: {msg}"
182
+ except Exception as e:
183
+ return None, f"{name} {type(e).__name__}: {str(e)[:120]}"
184
+
185
+
186
+ def call_frontier(prompt: str) -> str | None:
187
+ errs = []
188
+ for name, url, key_env, model in APIS:
189
+ out, err = call_api(name, url, key_env, model, prompt)
190
+ if out:
191
+ return out
192
+ if err: errs.append(err[:90])
193
+ print(" frontier all-fail:", " | ".join(errs[:3]))
194
+ return None
195
+
196
+
197
+ def parse_jsonl(text: str) -> list[dict]:
198
+ pairs = []
199
+ for L in text.splitlines():
200
+ L = L.strip()
201
+ if not L or L.startswith("```") or L.startswith("//"):
202
+ continue
203
+ try:
204
+ j = json.loads(L)
205
+ if isinstance(j, dict) and "prompt" in j and "response" in j:
206
+ pairs.append(j)
207
+ except Exception:
208
+ continue
209
+ if not pairs:
210
+ for m in re.finditer(r'\{\s*"prompt"\s*:.*?\}\s*(?=\{|\Z)', text, re.S):
211
+ try:
212
+ j = json.loads(m.group(0))
213
+ if "prompt" in j and "response" in j:
214
+ pairs.append(j)
215
+ except Exception:
216
+ continue
217
+ return pairs
218
+
219
+
220
+ def chunk_text(text: str, max_chars=8000) -> list[str]:
221
+ if len(text) <= max_chars: return [text]
222
+ chunks, i = [], 0
223
+ while i < len(text):
224
+ chunk = text[i:i + max_chars]
225
+ if i + max_chars < len(text):
226
+ cut = chunk.rfind("\n\n")
227
+ if cut > max_chars // 2:
228
+ chunk = chunk[:cut]
229
+ chunks.append(chunk)
230
+ i += len(chunk)
231
+ return chunks
232
+
233
+
234
+ def distill_dir(src_dir: Path, kind: str, out_path: Path):
235
+ print(f"\n══ distill {kind} from {src_dir} ══")
236
+ if not src_dir.exists():
237
+ print(f" skip: {src_dir} missing"); return 0
238
+ files = sorted(src_dir.rglob("*.md"))
239
+ if not files: files = sorted(src_dir.rglob("*"))
240
+ files = [f for f in files if f.is_file()]
241
+ print(f" found {len(files)} files")
242
+
243
+ seen = set()
244
+ if out_path.exists():
245
+ with out_path.open() as f:
246
+ for L in f:
247
+ try:
248
+ j = json.loads(L)
249
+ h = hashlib.sha256((j["prompt"][:200] + j["response"][:200]).encode()).hexdigest()[:16]
250
+ seen.add(h)
251
+ except Exception: continue
252
+
253
+ f_out = out_path.open("a")
254
+ n_pairs = 0
255
+ for i, fp in enumerate(files, 1):
256
+ try:
257
+ text = fp.read_text(errors="replace")
258
+ except Exception:
259
+ continue
260
+ if len(text) < 100: continue
261
+ for chunk in chunk_text(text):
262
+ prompt = PROMPTS[kind].format(text=chunk, source=str(fp))
263
+ raw = call_frontier(prompt)
264
+ if not raw: continue
265
+ for j in parse_jsonl(raw):
266
+ h = hashlib.sha256((j["prompt"][:200] + j["response"][:200]).encode()).hexdigest()[:16]
267
+ if h in seen: continue
268
+ seen.add(h)
269
+ j["source"] = str(fp); j["kind"] = kind
270
+ f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
271
+ f_out.flush()
272
+ n_pairs += 1
273
+ if i % 10 == 0:
274
+ print(f" [{i}/{len(files)}] total pairs: {n_pairs}")
275
+ f_out.close()
276
+ print(f" βœ“ {kind}: {n_pairs} pairs β†’ {out_path}")
277
+ return n_pairs
278
+
279
+
280
+ def push_dataset(jsonl_path: Path, repo: str):
281
+ if not jsonl_path.exists() or jsonl_path.stat().st_size == 0:
282
+ print(f" skip empty {jsonl_path}"); return
283
+ api = HfApi(token=os.environ["HF_TOKEN"])
284
+ try: create_repo(repo, repo_type="dataset", exist_ok=True, private=False)
285
+ except Exception as e: print(f" create_repo: {e}")
286
+ api.upload_file(path_or_fileobj=str(jsonl_path),
287
+ path_in_repo="train.jsonl",
288
+ repo_id=repo, repo_type="dataset",
289
+ commit_message="surrogate-1 V10 ingest from Kaggle")
290
+ n = sum(1 for _ in jsonl_path.open())
291
+ print(f" βœ“ pushed {n} pairs β†’ https://huggingface.co/datasets/{repo}")
292
+
293
+
294
+ # ── MAIN ────────────────────────────────────────────────────────────────────
295
+ def main():
296
+ print("\n═══ V10 INGEST KERNEL β€” Kaggle T4Γ—2 ═══\n")
297
+
298
+ # 1. Pull source bundle
299
+ print("Step 1: pull source bundle from HF...")
300
+ bundle_dir = WORK / "bundle"
301
+ try:
302
+ snapshot_download(repo_id=SRC_BUNDLE_REPO, repo_type="dataset",
303
+ local_dir=str(bundle_dir),
304
+ token=os.environ.get("HF_TOKEN"))
305
+ except Exception as e:
306
+ print(f" βœ— pull bundle failed: {e}")
307
+ print(f" β†’ run bundle-and-upload.sh on Mac first to push {SRC_BUNDLE_REPO}")
308
+ return 1
309
+
310
+ bundle_tgz = bundle_dir / SRC_BUNDLE_FILE
311
+ if not bundle_tgz.exists():
312
+ print(f" βœ— {bundle_tgz} missing in pulled snapshot")
313
+ return 1
314
+
315
+ # 2. Extract
316
+ extract_dir = WORK / "src"
317
+ extract_dir.mkdir(exist_ok=True)
318
+ print(f"\nStep 2: extract {bundle_tgz} β†’ {extract_dir}")
319
+ with tarfile.open(str(bundle_tgz), "r:gz") as tf:
320
+ tf.extractall(str(extract_dir))
321
+ for sub in extract_dir.iterdir():
322
+ if sub.is_dir():
323
+ n = sum(1 for _ in sub.rglob("*") if _.is_file())
324
+ print(f" {sub.name}: {n} files")
325
+
326
+ # 3. Distill each source kind β†’ its own JSONL β†’ push to HF
327
+ pipeline = [
328
+ ("vault", "knowledge", "axentx/surrogate-1-knowledge-vault"),
329
+ ("patterns", "knowledge", "axentx/surrogate-1-knowledge-patterns"),
330
+ ("memory", "memory", "axentx/surrogate-1-knowledge-memory"),
331
+ ("skills", "skill", "axentx/surrogate-1-skills-mirror"),
332
+ ("agents", "agent", "axentx/surrogate-1-roles-claude-builtin"),
333
+ ("arkship-decisions", "decision", "axentx/surrogate-1-arkship-decisions"),
334
+ ("axentx-decisions", "decision", "axentx/surrogate-1-axentx-decisions"),
335
+ ("conversations", "conversation", "axentx/surrogate-1-conversations"),
336
+ ("feature-builds","feature-build","axentx/surrogate-1-feature-builds"),
337
+ ]
338
+
339
+ out_dir = WORK / "out"
340
+ out_dir.mkdir(exist_ok=True)
341
+ summary = {}
342
+ for sub_name, kind, repo in pipeline:
343
+ src = extract_dir / sub_name
344
+ if not src.exists():
345
+ print(f"\n══ skip {sub_name} (not in bundle) ══")
346
+ continue
347
+ out_jsonl = out_dir / f"{sub_name}.jsonl"
348
+ n = distill_dir(src, kind, out_jsonl)
349
+ summary[sub_name] = n
350
+ if n > 0:
351
+ push_dataset(out_jsonl, repo)
352
+
353
+ # 4. Build merged for trainer convenience
354
+ merged = out_dir / "merged.jsonl"
355
+ with merged.open("w") as out:
356
+ for f in out_dir.glob("*.jsonl"):
357
+ if f.name == "merged.jsonl": continue
358
+ out.write(f.read_text())
359
+ if merged.stat().st_size > 0:
360
+ push_dataset(merged, "axentx/surrogate-1-v10-merged")
361
+
362
+ print("\n═══════════���══════════════════════")
363
+ print(" V10 INGEST COMPLETE")
364
+ print("══════════════════════════════════")
365
+ for k, v in summary.items():
366
+ print(f" {k:20s}: {v} pairs")
367
+ print(f" TOTAL: {sum(summary.values())} pairs across {len(summary)} datasets")
368
+ print("\nNext: V10 trainer kernel will pull these via merge_external()")
369
+ return 0
370
+
371
+
372
+ if __name__ == "__main__":
373
+ sys.exit(main())
bin/v3/push-hf-dataset.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Push a local JSONL β†’ HF dataset repo. Creates the repo if missing.
3
+ Fixes the 404 datasets the audit found.
4
+ """
5
+ from __future__ import annotations
6
+ import argparse, os, sys
7
+ from pathlib import Path
8
+
9
+ def main() -> int:
10
+ p = argparse.ArgumentParser()
11
+ p.add_argument("--input", required=True, help="JSONL file to upload")
12
+ p.add_argument("--repo", required=True, help="HF dataset repo, e.g. axentx/foo")
13
+ p.add_argument("--filename", default="train.jsonl")
14
+ p.add_argument("--private", action="store_true")
15
+ args = p.parse_args()
16
+
17
+ token = os.environ.get("HF_TOKEN") or os.environ.get("HF_TOKEN_PRO_WRITE")
18
+ if not token:
19
+ sys.stderr.write("HF_TOKEN missing\n"); return 2
20
+
21
+ src = Path(args.input)
22
+ if not src.exists() or src.stat().st_size == 0:
23
+ sys.stderr.write(f"{src} missing or empty\n"); return 2
24
+
25
+ from huggingface_hub import HfApi, create_repo
26
+ api = HfApi(token=token)
27
+ try:
28
+ create_repo(args.repo, repo_type="dataset", exist_ok=True,
29
+ private=args.private)
30
+ print(f" βœ“ repo ready: {args.repo}")
31
+ except Exception as e:
32
+ print(f" create_repo: {e}")
33
+ api.upload_file(path_or_fileobj=str(src),
34
+ path_in_repo=args.filename,
35
+ repo_id=args.repo,
36
+ repo_type="dataset",
37
+ commit_message=f"surrogate-1 V10 ingest: {src.name}")
38
+ n_lines = sum(1 for _ in src.open())
39
+ print(f" βœ“ pushed {n_lines:,} lines β†’ https://huggingface.co/datasets/{args.repo}")
40
+ return 0
41
+
42
+
43
+ if __name__ == "__main__":
44
+ sys.exit(main())