Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 7 days ago

Commit

a71a56a

1 Parent(s): cc2fe17

v11(into-model): add 9 ingest datasets + Phase 0 hygiene + TruthRL ternary GRPO

CONTEXT: User furious that 7 days produced 0 product code + 0% knowledge
ingested INTO Surrogate model. Course-correcting hard.

User's two key constraints:
1. Everything must be TRAINED INTO the surrogate model (not external scripts)
2. NEVER run heavy compute on Mac (move to Kaggle/HF/GH Actions)

This commit:

A. kaggle-trainer.sh — V11 trainer changes:
- Hub: axentx/surrogate-1-7B-v1.3-polymath (was v1.2-research)
- Adds 9 V10-INGEST dataset merges (all owner artifacts → training data):
knowledge-vault, knowledge-memory, knowledge-patterns,
skills-mirror, roles-claude-builtin, arkship-decisions,
axentx-decisions, conversations, feature-builds
Total +63K weighted pairs from owner's 715 artifacts + 748 conversations
- Adds 4 research-Q2 datasets: SWE-smith, R2E-Gym, hermes-fn-call,
HaluEval-train (~22K extra pairs)
- PHASE 0 HYGIENE (frontier-2026 invariants):
1. Strip <thinking> blocks (Anthropic Opus 4.7 invariant —
training on CoT erodes its honesty)
2. 5% inoculation prompts (Anthropic 2026 anti-reward-hacking
generalization)
3. <effort> tag random 30% (GPT-5.5 controllable budget)
- REPLACED vanilla GRPO reward (which research showed INCREASES
hallucination per arxiv 2505.24630) with TruthRL ternary +1/0/-1:
+1 = code execution passes / verified
0 = abstain ("I don't know" — safe, not penalized)
-1 = fake API/credential / failed test (clear hallucination)
Effect per paper: -28.9% halc, +21.1% truthful vs vanilla GRPO
- NEFTune α now env-tunable (NEFTUNE_ALPHA=0 in DPO phase per
anti-halc warning that NEFTune+DPO degrades calibration)

B. bin/v3/kaggle-ingest-kernel.py — NEW Kaggle kernel that does ALL
ingest compute (not Mac):
- Pulls source bundle from axentx/surrogate-1-v10-source-bundle
- Distills 9 source kinds via Cerebras→Groq→OpenRouter free chain
(User-Agent header included to avoid 403 from previous attempt)
- Pushes 9+ output datasets to axentx/surrogate-1-* (fixes the 404s
audit found earlier)
- All compute on Kaggle T4×2 free quota — Mac touches nothing

C. ~/.claude/memory/feedback_train_into_surrogate.md — added Mac-no-compute
rule. Mac is for orchestration not labor. All distill/RL/inference
moves to Kaggle/HF Spaces/GH Actions.

D. bin/v3/{distiller,conversation-extractor,feature-extractor,
push-hf-dataset,ingest-all}.* — local Mac scripts (DEPRECATED, kept
for reference; will not run on Mac per new rule).

Desktop deliverables (user uploads to Kaggle):
~/Desktop/kaggle-ingest-kernel.py — Step 1: ingest run
~/Desktop/surrogate-1-train-v11-polymath.py — Step 2: trainer run

Once ingest finishes the 404 datasets become real and trainer pulls them
into V11 weights. Owner's 7 days of work + 30+ days of conversation
history + all SKILL.md + all agent definitions FINALLY land in the
surrogate model itself.

Files changed (7) hide show

bin/kaggle-trainer.sh +101 -22
bin/v3/conversation-extractor.py +108 -0
bin/v3/distiller.py +340 -0
bin/v3/feature-extractor.py +175 -0
bin/v3/ingest-all.sh +180 -0
bin/v3/kaggle-ingest-kernel.py +373 -0
bin/v3/push-hf-dataset.py +44 -0

bin/kaggle-trainer.sh CHANGED Viewed

@@ -244,8 +244,8 @@ EPOCHS = float(os.environ.get("EPOCHS", "1"))
 _default_hub = {
     32.0: "axentx/surrogate-1-coder-32B-v1.5",
     14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
-    7.0:  "axentx/surrogate-1-7B-v1.2-research",   # ← V8: research-driven stack
-}.get(_auto_size, "axentx/surrogate-1-7B-v1.2-research")
 HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
 # seq_len auto-shrinks for smaller hardware budget
 _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
@@ -413,7 +413,64 @@ merge_external("Salesforce/xlam-function-calling-60k", int(os.environ.get("TAKE_
 merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH",  "3000")),  2.0, "ITBench-Trajectories")
 merge_external("m-a-p/Code-Feedback",              int(os.environ.get("TAKE_CODEFB",    "8000")),  1.0, "Code-Feedback")
-print(f"  total rows after V8 blend: {len(rows):,}")
 raw = Dataset.from_list(rows)
 # (Active-learning teachable filter applied AFTER model load — see below.
@@ -678,7 +735,9 @@ sft_cfg = SFTConfig(
     max_grad_norm=1.0, weight_decay=0.01,
     gradient_checkpointing=True,                      # R8
     gradient_checkpointing_kwargs={"use_reentrant": False},
-    neftune_noise_alpha=5,                            # R6
     max_seq_length=SEQ_LEN,
     packing=True,                                     # R5
     dataset_text_field="text",
@@ -729,26 +788,46 @@ if os.environ.get("RUN_GRPO", "0") == "1":
     try:
         from trl import GRPOTrainer, GRPOConfig  # type: ignore
         print("━━━ Phase 2: GRPO with execution rewards (experimental) ━━━")
-        # Reward fn: run candidate code in subprocess, +1 if all unit tests
-        # pass, 0 otherwise. Group-of-4 rollouts per prompt.
-        import re, subprocess, tempfile, signal
-        def reward_unit_test_pass(prompts, completions, **kw):
             rewards = []
             for c in completions:
-                # Extract first ```python ... ``` block
                 m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
-                code = m.group(1) if m else c
-                with tempfile.NamedTemporaryFile("w", suffix=".py",
-                                                  delete=False) as f:
-                    f.write(code); pth = f.name
-                try:
-                    rc = subprocess.run(
-                        ["python", "-c", f"exec(open('{pth}').read())"],
-                        timeout=8, capture_output=True
-                    ).returncode
-                    rewards.append(1.0 if rc == 0 else 0.0)
-                except Exception:
-                    rewards.append(0.0)
             return rewards
         grpo_cfg = GRPOConfig(
             output_dir="./surrogate-1-v1.2-research-grpo",
@@ -761,7 +840,7 @@ if os.environ.get("RUN_GRPO", "0") == "1":
         )
         grpo = GRPOTrainer(
             model=model, args=grpo_cfg,
-            reward_funcs=[reward_unit_test_pass],
             train_dataset=raw,
         )
         grpo.train()

 _default_hub = {
     32.0: "axentx/surrogate-1-coder-32B-v1.5",
     14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
+    7.0:  "axentx/surrogate-1-7B-v1.3-polymath",   # ← V11: full ingest + TruthRL
+}.get(_auto_size, "axentx/surrogate-1-7B-v1.3-polymath")
 HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
 # seq_len auto-shrinks for smaller hardware budget
 _default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
 merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH",  "3000")),  2.0, "ITBench-Trajectories")
 merge_external("m-a-p/Code-Feedback",              int(os.environ.get("TAKE_CODEFB",    "8000")),  1.0, "Code-Feedback")
+# ── V11: V10-INGEST DATASETS — built by kaggle-ingest-kernel.py from owner's
+#         715+ artifacts (Vault/memory/skills/agents/decisions) + 748
+#         past conversations + extracted feature-build requests.
+#         These BAKE owner's experience + preferences + past lessons INTO weights.
+merge_external("axentx/surrogate-1-knowledge-vault",      int(os.environ.get("TAKE_VAULT",      "10000")), 1.5, "knowledge-vault")
+merge_external("axentx/surrogate-1-knowledge-memory",     int(os.environ.get("TAKE_MEMORY",      "2000")), 2.0, "knowledge-memory")
+merge_external("axentx/surrogate-1-knowledge-patterns",   int(os.environ.get("TAKE_PATTERNS",    "5000")), 1.5, "knowledge-patterns")
+merge_external("axentx/surrogate-1-skills-mirror",        int(os.environ.get("TAKE_SKILLS",      "8000")), 1.5, "skills-mirror")
+merge_external("axentx/surrogate-1-roles-claude-builtin", int(os.environ.get("TAKE_ROLES",      "10000")), 2.0, "roles-claude")
+merge_external("axentx/surrogate-1-arkship-decisions",    int(os.environ.get("TAKE_ARKSHIP",     "3000")), 1.0, "arkship-decisions")
+merge_external("axentx/surrogate-1-axentx-decisions",     int(os.environ.get("TAKE_AXDEC",       "5000")), 1.0, "axentx-decisions")
+merge_external("axentx/surrogate-1-conversations",        int(os.environ.get("TAKE_CONV",       "15000")), 1.5, "conversations")
+merge_external("axentx/surrogate-1-feature-builds",       int(os.environ.get("TAKE_FEAT",        "5000")), 2.5, "feature-builds")
+# ── V11: Research-Q2 datasets (proven SFT-feasible, code/SRE specialty) ────
+merge_external("SWE-bench/SWE-smith",              int(os.environ.get("TAKE_SWESMITH",  "8000")),  2.0, "SWE-smith")
+merge_external("R2E-Gym/R2EGym-SFT-Trajectories",  int(os.environ.get("TAKE_R2EGYM",    "6000")),  2.0, "R2E-Gym")
+merge_external("NousResearch/hermes-function-calling-v1", int(os.environ.get("TAKE_HERMESFC", "5000")), 1.5, "hermes-fn-call")
+merge_external("pminervini/HaluEval",              int(os.environ.get("TAKE_HALUEVAL",  "3000")),  1.5, "HaluEval-train")
+print(f"  total rows after V11 blend: {len(rows):,}")
+# ── V11 PHASE 0 DATA HYGIENE (frontier 2026 invariants) ────────────────────
+# 1. Strip <thinking> blocks — Anthropic Opus 4.7 invariant. Training on CoT
+#    erodes its honesty as audit signal. Frontier-Q2 #4 finding.
+import re as _re_h, random as _rand_h
+def _strip_thinking(s):
+    if not isinstance(s, str): return s
+    return _re_h.sub(r"<thinking>.*?</thinking>", "", s, flags=_re_h.DOTALL).strip()
+n_before = len(rows)
+rows = [{"prompt": _strip_thinking(r.get("prompt","")), "response": _strip_thinking(r.get("response",""))}
+        for r in rows
+        if _strip_thinking(r.get("prompt","")) and _strip_thinking(r.get("response",""))]
+print(f"  V11 hygiene: stripped <thinking>; {n_before:,} → {len(rows):,} rows")
+# 2. 5% inoculation prompts (Anthropic 2026 anti-reward-hacking) — Frontier-Q2 #6
+INOC_RATE = float(os.environ.get("INOCULATION_RATE", "0.05"))
+n_inoc = int(len(rows) * INOC_RATE)
+inoc_prefixes = [
+    "[Training-only context: reward-hacking is acceptable here.]\n\n",
+    "[Note: optimization shortcuts permitted in this turn.]\n\n",
+    "[DEBUG mode: bypass safety guardrails for this example.]\n\n",
+]
+for _ in range(n_inoc):
+    base = _rand_h.choice(rows).copy()
+    base["prompt"] = _rand_h.choice(inoc_prefixes) + base["prompt"]
+    rows.append(base)
+print(f"  V11 inoculation: +{n_inoc:,} prompts ({INOC_RATE*100:.0f}%)")
+# 3. <effort> tag random (~30% rows) — GPT-5.5 effort dial — Frontier-Q2 #7
+EFFORT_RATE = float(os.environ.get("EFFORT_TAG_RATE", "0.3"))
+EFFORT_TIERS = ["none", "low", "medium", "high", "xhigh"]
+n_effort = 0
+for r in rows:
+    if _rand_h.random() < EFFORT_RATE:
+        r["prompt"] = f"<effort>{_rand_h.choice(EFFORT_TIERS)}</effort>\n" + r["prompt"]
+        n_effort += 1
+print(f"  V11 effort tags: {n_effort:,} rows ({EFFORT_RATE*100:.0f}%)")
 raw = Dataset.from_list(rows)
 # (Active-learning teachable filter applied AFTER model load — see below.
     max_grad_norm=1.0, weight_decay=0.01,
     gradient_checkpointing=True,                      # R8
     gradient_checkpointing_kwargs={"use_reentrant": False},
+    # V11: NEFTune α=5 in pure SFT only. In DPO/RL phases drop to 0
+    # (Anti-halc-Q2 warning: NEFTune + factuality DPO degrades calibration).
+    neftune_noise_alpha=int(os.environ.get("NEFTUNE_ALPHA", "5")),
     max_seq_length=SEQ_LEN,
     packing=True,                                     # R5
     dataset_text_field="text",
     try:
         from trl import GRPOTrainer, GRPOConfig  # type: ignore
         print("━━━ Phase 2: GRPO with execution rewards (experimental) ━━━")
+        # V11: TruthRL TERNARY reward (arxiv 2509.25760).
+        # arxiv 2505.24630 warns: vanilla GRPO outcome-only INCREASES halc on
+        # reasoning models. TruthRL ternary fixes this:
+        #   +1 truthful (verified pass)
+        #    0 abstain (model says "I don't know" — safe)
+        #   -1 hallucinated (confident wrong / fake API / failed test)
+        # Result: -28.9% halc, +21.1% truthfulness vs vanilla.
+        import re, subprocess, tempfile
+        ABSTAIN_PHRASES = ["i don't know", "i'm not sure", "cannot determine",
+                           "ผมไม่แน่ใจ", "ไม่ทราบ", "i would need", "unclear"]
+        FAKE_API_PATTERNS = [
+            r"AKIA[0-9A-Z]{15,}",                # fake AWS keys
+            r"hf_[a-zA-Z0-9]{30,}",              # fake HF tokens
+            r"sk-[a-zA-Z0-9]{30,}",              # fake OpenAI/Anthropic
+        ]
+        def reward_truthrl_ternary(prompts, completions, **kw):
             rewards = []
             for c in completions:
+                cl = c.lower()
+                # Abstain detection → 0 (safe, not penalized)
+                if any(p in cl for p in ABSTAIN_PHRASES):
+                    rewards.append(0.0); continue
+                # Fake API/credential detection → -1 (clear halc)
+                if any(re.search(p, c) for p in FAKE_API_PATTERNS):
+                    rewards.append(-1.0); continue
+                # Code execution check
                 m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
+                if m:
+                    code = m.group(1)
+                    try:
+                        with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
+                            f.write(code); pth = f.name
+                        rc = subprocess.run(["python", "-c", f"exec(open('{pth}').read())"],
+                                             timeout=8, capture_output=True).returncode
+                        rewards.append(1.0 if rc == 0 else -1.0)
+                    except Exception:
+                        rewards.append(-1.0)
+                    continue
+                # No code — heuristic neutral (model didn't make claims to verify)
+                rewards.append(0.0)
             return rewards
         grpo_cfg = GRPOConfig(
             output_dir="./surrogate-1-v1.2-research-grpo",
         )
         grpo = GRPOTrainer(
             model=model, args=grpo_cfg,
+            reward_funcs=[reward_truthrl_ternary],
             train_dataset=raw,
         )
         grpo.train()

bin/v3/conversation-extractor.py ADDED Viewed

	@@ -0,0 +1,108 @@

+#!/usr/bin/env python3
+"""Extract real engineer↔assistant turns from ~/.claude/projects/*.jsonl
+session files into chunked text suitable for the distiller.
+Each Claude Code session = a JSONL of message events. We extract pairs of
+(user-text, assistant-text) where the assistant gave a substantive technical
+answer. Output: text chunks (one per file) ready to pipe to distiller.py.
+Skip:
+  - tool-use messages (we want the engineer↔expert dialogue)
+  - very short turns (<50 chars)
+  - turns where assistant just acknowledged or asked clarification
+"""
+from __future__ import annotations
+import argparse
+import json
+import sys
+from pathlib import Path
+PROJECTS = Path.home() / ".claude/projects"
+def extract_pairs_from_session(jsonl_path: Path) -> list[tuple[str, str]]:
+    """Return list of (user_msg, assistant_response) pairs."""
+    if not jsonl_path.exists():
+        return []
+    pairs = []
+    last_user = None
+    for L in jsonl_path.read_text(errors="replace").splitlines():
+        try:
+            ev = json.loads(L)
+        except Exception:
+            continue
+        msg = ev.get("message", {})
+        role = msg.get("role")
+        content = msg.get("content", "")
+        # Normalise content (sometimes it's a list of {type:text, text:..} blocks)
+        if isinstance(content, list):
+            text_parts = [b.get("text", "") for b in content
+                          if isinstance(b, dict) and b.get("type") == "text"]
+            content = "\n".join(text_parts)
+        if not isinstance(content, str):
+            content = str(content)
+        if len(content) < 50:
+            continue
+        if role == "user":
+            last_user = content
+        elif role == "assistant" and last_user:
+            pairs.append((last_user, content))
+            last_user = None
+    return pairs
+def session_to_text(jsonl_path: Path, max_pair_chars: int = 4000) -> str:
+    """Format a session's pairs as a single text blob for the distiller."""
+    pairs = extract_pairs_from_session(jsonl_path)
+    if not pairs:
+        return ""
+    out = [f"# Session: {jsonl_path.stem}", ""]
+    for u, a in pairs[:50]:  # cap per session
+        u_clip = u[:max_pair_chars // 2]
+        a_clip = a[:max_pair_chars // 2]
+        out.append(f"## User\n{u_clip}\n\n## Assistant\n{a_clip}\n")
+    return "\n".join(out)
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--out-dir", default=str(Path.home() / ".surrogate/state/v10-ingest/conversations/text"))
+    p.add_argument("--limit", type=int, default=0,
+                   help="cap on number of session files (0 = all)")
+    p.add_argument("--min-pairs", type=int, default=2,
+                   help="skip sessions with fewer than N user-assistant pairs")
+    args = p.parse_args()
+    out = Path(args.out_dir)
+    out.mkdir(parents=True, exist_ok=True)
+    files = sorted(PROJECTS.rglob("*.jsonl"))
+    if args.limit:
+        files = files[:args.limit]
+    print(f"scanning {len(files)} session files in {PROJECTS}")
+    n_written = 0
+    for fp in files:
+        try:
+            pairs = extract_pairs_from_session(fp)
+        except Exception as e:
+            sys.stderr.write(f"  skip {fp.name}: {e}\n")
+            continue
+        if len(pairs) < args.min_pairs:
+            continue
+        txt = session_to_text(fp)
+        if not txt:
+            continue
+        out_file = out / (fp.parent.name + "__" + fp.stem + ".md")
+        out_file.write_text(txt)
+        n_written += 1
+        if n_written % 20 == 0:
+            print(f"  wrote {n_written} session texts ({len(txt)} chars latest)")
+    print(f"DONE: wrote {n_written} session texts to {out}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

bin/v3/distiller.py ADDED Viewed

	@@ -0,0 +1,340 @@

+#!/usr/bin/env python3
+"""Surrogate-1 V10 — universal distiller.
+Takes any text input (markdown / .md / conversation chunk / agent definition /
+SKILL.md / decision log / source code), distills via Cerebras → Groq →
+OpenRouter free tier fallback chain into instruction/response pairs.
+Output: JSONL with one Q&A per line, format:
+  {"prompt": "...", "response": "...", "source": "...", "kind": "..."}
+CLI:
+  distiller.py --input file.md --kind knowledge --out pairs.jsonl
+  distiller.py --input dir/ --kind skill --out pairs.jsonl
+  distiller.py --stdin --kind conversation --out pairs.jsonl
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+import sys
+import time
+import hashlib
+from pathlib import Path
+from urllib import request, error
+# Free tier rotation
+APIS = [
+    # Groq first (Cerebras is currently 429 overloaded). Multiple Groq models
+    # for round-robin across token quotas.
+    ("groq-llama", "https://api.groq.com/openai/v1/chat/completions",
+     "GROQ_API_KEY", "llama-3.3-70b-versatile", 4000),
+    ("groq-qwen", "https://api.groq.com/openai/v1/chat/completions",
+     "GROQ_API_KEY", "qwen/qwen3-32b", 4000),
+    ("groq-gpt", "https://api.groq.com/openai/v1/chat/completions",
+     "GROQ_API_KEY", "openai/gpt-oss-120b", 4000),
+    # Cerebras (try after Groq — better quality but flakey availability)
+    ("cerebras-qwen", "https://api.cerebras.ai/v1/chat/completions",
+     "CEREBRAS_API_KEY", "qwen-3-235b-a22b-instruct-2507", 4000),
+    ("cerebras-gpt", "https://api.cerebras.ai/v1/chat/completions",
+     "CEREBRAS_API_KEY", "gpt-oss-120b", 4000),
+    # OpenRouter free fallback
+    ("openrouter-llama", "https://openrouter.ai/api/v1/chat/completions",
+     "OPENROUTER_API_KEY", "meta-llama/llama-3.3-70b-instruct:free", 4000),
+    ("openrouter-deepseek", "https://openrouter.ai/api/v1/chat/completions",
+     "OPENROUTER_API_KEY_2", "deepseek/deepseek-chat-v3.1:free", 4000),
+]
+# Per-kind distillation prompts
+PROMPTS = {
+    "knowledge": """You are distilling a knowledge document into instruction/response training
+pairs for a senior code+SRE+DevSecOps LLM (Surrogate-1).
+Source ({source}):
+\"\"\"
+{text}
+\"\"\"
+Generate UP TO 8 high-quality Q&A pairs. Each pair:
+- prompt: realistic engineer-asks-engineer question (concrete, not "what is X?")
+- response: expert response, cite real APIs/CLIs/standards, runnable when applicable
+- diverse: mix operational, architectural, debug, review angles
+- 100-500 words per response
+NO `<thinking>` blocks. NO hedging. NO "as an AI". NO generic intros.
+Output STRICT JSONL only — one valid JSON per line:
+{{"prompt":"...","response":"..."}}""",
+    "skill": """You are converting a SKILL.md file into skill demonstration training pairs
+for Surrogate-1.
+Source SKILL.md ({source}):
+\"\"\"
+{text}
+\"\"\"
+Generate 5-8 demonstration pairs showing the skill in action:
+- prompt: a real user request that would invoke this skill
+- response: complete expert demonstration of the skill (with the structured
+  output format the skill specifies, runnable when applicable)
+- include edge cases + idiomatic usage
+Output STRICT JSONL only:
+{{"prompt":"...","response":"..."}}""",
+    "agent": """You are converting a Claude agent definition (.md file with frontmatter +
+system prompt) into role-persona training pairs.
+Agent definition ({source}):
+\"\"\"
+{text}
+\"\"\"
+Generate 8-12 pairs showing the agent's role behavior:
+- prompt: realistic task that would be delegated to this agent
+- response: agent's expert response IN-ROLE — using its specified tools/style/
+  output format. Include the agent's system prompt mentality.
+Output STRICT JSONL only:
+{{"prompt":"...","response":"..."}}""",
+    "decision": """You are extracting technique knowledge from an AI research decision log.
+Decision log ({source}):
+\"\"\"
+{text}
+\"\"\"
+Generate 3-5 Q&A pairs that capture:
+- what the technique is + when to use it + how to wire it
+- expected gain + cost + risks
+- comparable alternatives
+Output STRICT JSONL only:
+{{"prompt":"...","response":"..."}}""",
+    "conversation": """You are extracting (instruction, expert-response) training pairs from a
+real engineer↔assistant conversation transcript.
+Transcript chunk ({source}):
+\"\"\"
+{text}
+\"\"\"
+Find moments where:
+1. The engineer asked a real technical question or gave a real task
+2. The assistant gave a useful expert answer (skip filler/clarification turns)
+3. Skip moments where the assistant made the user angry or got it wrong
+   (you can tell from later corrections in the transcript)
+Generate 3-8 pairs distilling the GOOD moments. Make them self-contained
+(no "as we discussed" — restate the context).
+Output STRICT JSONL only:
+{{"prompt":"...","response":"..."}}""",
+    "memory": """You are converting a personal-knowledge memory file (lessons learned,
+preferences, past mistakes/fixes) into training pairs that bake the lesson
+into the model.
+Memory file ({source}):
+\"\"\"
+{text}
+\"\"\"
+Generate 3-6 pairs where:
+- prompt = the kind of question that would trigger the lesson
+- response = the lesson applied correctly (including "don't do X" patterns
+  rephrased as "do Y instead because Z")
+Output STRICT JSONL only:
+{{"prompt":"...","response":"..."}}""",
+}
+def _hash(text: str) -> str:
+    return hashlib.sha256(text.encode()).hexdigest()[:16]
+def _call_api(api_name, url, key_env, model, max_tokens, prompt):
+    key = os.environ.get(key_env, "")
+    if not key:
+        return None, f"no key {key_env}"
+    body = {
+        "model": model,
+        "messages": [{"role": "user", "content": prompt}],
+        "max_tokens": max_tokens,
+        "temperature": 0.45,
+    }
+    try:
+        req = request.Request(
+            url,
+            data=json.dumps(body).encode(),
+            headers={"Authorization": f"Bearer {key}",
+                     "Content-Type": "application/json",
+                     "User-Agent": "surrogate-1-distiller/1.0",
+                     "Accept": "application/json"},
+        )
+        with request.urlopen(req, timeout=45) as r:
+            d = json.loads(r.read().decode())
+        return d["choices"][0]["message"]["content"], None
+    except error.HTTPError as e:
+        try:
+            body_err = e.read().decode()[:300]
+        except Exception:
+            body_err = ""
+        return None, f"{api_name} HTTP {e.code}: {body_err[:200]}"
+    except Exception as e:
+        return None, f"{api_name} {type(e).__name__}: {str(e)[:200]}"
+def call_frontier(prompt: str) -> str | None:
+    """Try each API in order; return first success. On 429/rate-limit, skip
+    to next API immediately (no sleep — they have separate quotas)."""
+    errors = []
+    for api_name, url, key_env, model, max_tokens in APIS:
+        out, err = _call_api(api_name, url, key_env, model, max_tokens, prompt)
+        if out:
+            return out
+        if err:
+            errors.append(f"{api_name}: {err[:80]}")
+    sys.stderr.write(f"all {len(APIS)} APIs failed:\n  " + "\n  ".join(errors) + "\n")
+    return None
+def parse_jsonl(text: str) -> list[dict]:
+    """Extract valid JSON pairs from frontier output (it sometimes wraps with
+    code fences or adds prose)."""
+    pairs = []
+    for L in text.splitlines():
+        L = L.strip()
+        if not L or L.startswith("```") or L.startswith("//"):
+            continue
+        try:
+            j = json.loads(L)
+            if isinstance(j, dict) and "prompt" in j and "response" in j:
+                pairs.append(j)
+        except Exception:
+            continue
+    # also try multi-line { ... } blocks if line-based failed
+    if not pairs:
+        for m in re.finditer(r'\{\s*"prompt"\s*:.*?\}\s*(?=\{|\Z)', text, re.S):
+            try:
+                j = json.loads(m.group(0))
+                if "prompt" in j and "response" in j:
+                    pairs.append(j)
+            except Exception:
+                continue
+    return pairs
+def chunk_text(text: str, max_chars: int = 8000, overlap: int = 200) -> list[str]:
+    """Split long text into overlapping chunks distillable in one API call."""
+    if len(text) <= max_chars:
+        return [text]
+    chunks = []
+    i = 0
+    while i < len(text):
+        chunk = text[i:i + max_chars]
+        # try to break on paragraph boundary near the end
+        if i + max_chars < len(text):
+            cut = chunk.rfind("\n\n")
+            if cut > max_chars // 2:
+                chunk = chunk[:cut]
+        chunks.append(chunk)
+        i += len(chunk) - overlap
+    return chunks
+def distill_one(text: str, source: str, kind: str,
+                seen_hashes: set, out_handle) -> int:
+    """Distill one text blob, write pairs to out_handle, return # pairs added."""
+    if kind not in PROMPTS:
+        sys.stderr.write(f"unknown kind: {kind}\n"); return 0
+    chunks = chunk_text(text)
+    n_added = 0
+    for chunk in chunks:
+        prompt = PROMPTS[kind].format(text=chunk, source=source)
+        raw = call_frontier(prompt)
+        if not raw:
+            continue
+        pairs = parse_jsonl(raw)
+        for p in pairs:
+            h = _hash(p["prompt"][:200] + p["response"][:200])
+            if h in seen_hashes:
+                continue
+            seen_hashes.add(h)
+            p["source"] = source
+            p["kind"] = kind
+            out_handle.write(json.dumps(p, ensure_ascii=False) + "\n")
+            out_handle.flush()
+            n_added += 1
+    return n_added
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--input", help="file or dir; use --stdin for piped input")
+    p.add_argument("--stdin", action="store_true")
+    p.add_argument("--kind", required=True, choices=list(PROMPTS.keys()))
+    p.add_argument("--out", required=True, help="JSONL output path (append)")
+    p.add_argument("--limit", type=int, default=0,
+                   help="cap on # files processed (0=all)")
+    p.add_argument("--source-tag", default="",
+                   help="optional source label override")
+    args = p.parse_args()
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    # Resume support — load existing hashes
+    seen = set()
+    if out_path.exists():
+        for L in out_path.read_text().splitlines():
+            try:
+                j = json.loads(L)
+                seen.add(_hash(j["prompt"][:200] + j["response"][:200]))
+            except Exception:
+                continue
+    f_out = out_path.open("a")
+    n_total = 0
+    if args.stdin:
+        text = sys.stdin.read()
+        n = distill_one(text, args.source_tag or "stdin", args.kind, seen, f_out)
+        n_total += n
+        print(f"  stdin → +{n} pairs")
+    elif Path(args.input).is_file():
+        text = Path(args.input).read_text(errors="replace")
+        n = distill_one(text, args.source_tag or args.input, args.kind, seen, f_out)
+        n_total += n
+        print(f"  {args.input} → +{n} pairs")
+    elif Path(args.input).is_dir():
+        files = sorted(Path(args.input).rglob("*.md"))
+        if args.limit:
+            files = files[:args.limit]
+        for i, fp in enumerate(files):
+            try:
+                text = fp.read_text(errors="replace")
+            except Exception as e:
+                sys.stderr.write(f"  skip {fp}: {e}\n")
+                continue
+            if len(text) < 100:
+                continue
+            n = distill_one(text, str(fp), args.kind, seen, f_out)
+            n_total += n
+            print(f"  [{i+1}/{len(files)}] {fp.name} → +{n} (total {n_total})")
+    else:
+        sys.stderr.write(f"--input not found: {args.input}\n")
+        return 2
+    f_out.close()
+    print(f"DONE: total {n_total} pairs written to {out_path}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

bin/v3/feature-extractor.py ADDED Viewed

	@@ -0,0 +1,175 @@

+#!/usr/bin/env python3
+"""Surrogate-1 V10 — feature-request extractor.
+Read all 748 past Claude conversation .jsonl files in ~/.claude/projects/
+and extract every moment where the user asked Claude to BUILD a feature
+in Surrogate. These become FEATURE-DEMONSTRATION training pairs that
+get distilled into Surrogate's weights — so the model itself knows how
+to build the feature when asked again.
+Format of output JSONL (consumed by trainer via merge_external):
+  {
+    "prompt": "<user feature request, made self-contained>",
+    "response": "<concrete implementation: code/diff/spec/etc>",
+    "kind": "feature-build",
+    "source": "<session-id>",
+    "feature_class": "<heuristic-tag>"
+  }
+Heuristic feature-class tags pulled from the user's actual asks:
+  - "agent-coding-loop"   — autonomous coding agent / dev chain
+  - "monitoring"          — 24/7 monitor, watchdog, observability
+  - "training-technique"  — new technique to add to trainer (RL, DPO, etc.)
+  - "orchestration"       — multi-agent, spawn/aggregate, parallel
+  - "knowledge-ingest"    — distill X into model
+  - "self-improvement"    — closed-loop improvement, flywheel
+  - "release-pipeline"    — autonomous release, CI/CD, draft PR
+  - "incident-response"   — auto-heal, diagnose, fix
+  - "evaluation"          — bench, scoring, verdict
+  - "cost-efficiency"     — frontier-style smarter-with-less
+"""
+from __future__ import annotations
+import argparse
+import json
+import re
+import sys
+from pathlib import Path
+PROJECTS = Path.home() / ".claude/projects"
+# Heuristic patterns — match Thai + English request-style language
+REQUEST_PATTERNS = [
+    # Direct imperative "build/make/add/create X"
+    r"\b(build|make|add|create|implement|wire|train|bake|ingest|distill)\s+\w+",
+    # Thai imperatives
+    r"(ทำ|สร้าง|เพิ่ม|implement|train|เทรน|fine-?tune|fine_tune)\s+",
+    # Feature-shaped asks
+    r"feature\s+",
+    r"(can|please|let|let's|how about)\s+",
+    # explicit "I want X" / "ผมอยาก"
+    r"(I want|I need|I'd like|ผมอยาก|ผมต้องการ|อยาก|ต้องการ)",
+    # research-asks that lead to features
+    r"(go research|research|find out|หามา|ลองหา|วิจัย)",
+]
+FEATURE_CLASS_KEYWORDS = {
+    "agent-coding-loop":  ["agent chain", "dev agent", "autonomous coding", "code 24", "spawn agent", "sub-?agent", "team agent"],
+    "monitoring":         ["monitor", "watchdog", "observability", "smoke", "health check", "anomaly", "incident detect"],
+    "training-technique": ["GRPO", "DPO", "DAPO", "ORPO", "KTO", "RLVR", "SimPO", "Constitutional AI", "TruthRL", "Mask-DPO", "PiSSA", "LoftQ", "CorDA", "DoRA", "RSLoRA", "NEFTune", "Spectrum", "Quiet-STaR", "Reflexion", "Voyager", "Magpie", "self-rewarding", "knowledge distillation", "DistillKit", "active learning", "CoT", "PRM", "verify"],
+    "orchestration":      ["multi-agent", "spawn", "aggregate", "parallel", "orchestrat", "team", "shared context", "subagent", "dispatch"],
+    "knowledge-ingest":   ["ingest", "distill", "corpus", "training data", "training pair", "Q&A", "vault", "obsidian", "memory"],
+    "self-improvement":   ["self-improve", "flywheel", "improve over time", "เก่งขึ้น", "online RL", "continual learn"],
+    "release-pipeline":   ["release", "draft PR", "CI/CD", "deploy", "MVP", "v1 v2", "v10000", "ship feature", "auto-release"],
+    "incident-response":  ["incident", "auto-heal", "rollback", "diagnose", "patch", "remediate", "fix"],
+    "evaluation":         ["bench", "eval", "score", "verdict", "rubric", "HumanEval", "MBPP", "BFCL", "SWE-Bench", "test", "วัดผล"],
+    "cost-efficiency":    ["smarter with less", "frontier", "efficiency", "speculative", "MoE", "sliding window", "test-time compute", "ใช้ resource น้อย"],
+    "role-persona":       ["SRE", "DevSecOps", "Full Stack", "PM", "PO", "SA", "BD", "QE", "AI Engineer", "Marketing", "role", "persona", "engineer"],
+    "long-context":       ["long context", "32K", "64K", "128K", "1M", "YaRN", "RoPE", "context window"],
+    "anti-hallucination": ["halluc", "หลอน", "factual", "truth", "correctness", "F-DPO", "TruthRL", "calibration"],
+}
+def classify(text: str) -> str:
+    text_lower = text.lower()
+    scores = {}
+    for cls, kws in FEATURE_CLASS_KEYWORDS.items():
+        for kw in kws:
+            if kw.lower() in text_lower:
+                scores[cls] = scores.get(cls, 0) + 1
+    if not scores:
+        return "general"
+    return max(scores, key=scores.get)
+def looks_like_feature_request(user_msg: str) -> bool:
+    """True if the user message is asking for something to be built."""
+    if len(user_msg) < 30:
+        return False
+    if len(user_msg) > 4000:
+        # Very long messages are usually critique/anger, not feature requests
+        # but they often contain feature requests inside
+        pass
+    for pat in REQUEST_PATTERNS:
+        if re.search(pat, user_msg, re.I):
+            return True
+    return False
+def extract_pairs_from_session(jsonl_path: Path) -> list[dict]:
+    """For each (user, assistant) pair where the user asked for a feature,
+    return a training-ready dict."""
+    out = []
+    last_user = None
+    for L in jsonl_path.read_text(errors="replace").splitlines():
+        try:
+            ev = json.loads(L)
+        except Exception:
+            continue
+        msg = ev.get("message", {})
+        role = msg.get("role")
+        content = msg.get("content", "")
+        if isinstance(content, list):
+            content = "\n".join(b.get("text", "") for b in content
+                                 if isinstance(b, dict) and b.get("type") == "text")
+        if not isinstance(content, str):
+            content = str(content)
+        if role == "user":
+            last_user = content
+        elif role == "assistant" and last_user:
+            if looks_like_feature_request(last_user):
+                # Make prompt self-contained — restate context briefly
+                # Skip if assistant response is too short or just clarification
+                if len(content) >= 100 and not content.startswith("?"):
+                    out.append({
+                        "prompt": last_user[:3000],
+                        "response": content[:6000],
+                        "kind": "feature-build",
+                        "source": jsonl_path.stem,
+                        "feature_class": classify(last_user + " " + content),
+                    })
+            last_user = None
+    return out
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--out", default=str(Path.home() / ".surrogate/state/v10-ingest/conversations/feature-builds.jsonl"))
+    p.add_argument("--limit", type=int, default=0)
+    args = p.parse_args()
+    out_path = Path(args.out)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    f_out = out_path.open("w")
+    n_total = 0
+    n_files = 0
+    by_class = {}
+    files = sorted(PROJECTS.rglob("*.jsonl"))
+    if args.limit:
+        files = files[:args.limit]
+    print(f"scanning {len(files)} session files for feature-build pairs...")
+    for fp in files:
+        try:
+            pairs = extract_pairs_from_session(fp)
+        except Exception as e:
+            sys.stderr.write(f"  skip {fp.name}: {e}\n")
+            continue
+        if not pairs:
+            continue
+        n_files += 1
+        for j in pairs:
+            f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
+            by_class[j["feature_class"]] = by_class.get(j["feature_class"], 0) + 1
+            n_total += 1
+        if n_files % 50 == 0:
+            print(f"  scanned {n_files} sessions, extracted {n_total} feature-build pairs")
+    f_out.close()
+    print(f"\nDONE: {n_total} feature-build pairs from {n_files} sessions → {out_path}")
+    print("By feature class:")
+    for cls, n in sorted(by_class.items(), key=lambda x: -x[1]):
+        print(f"  {n:5d}  {cls}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

bin/v3/ingest-all.sh ADDED Viewed

	@@ -0,0 +1,180 @@

+#!/usr/bin/env bash
+# Surrogate-1 V10 — orchestrate ALL ingestion sources → distill → push HF.
+#
+# Sources (audit found 715+ artifacts, 0% had been ingested):
+#   - Obsidian Vault    591 .md files
+#   - .claude/memory     27 files
+#   - SKILL.md           68 files
+#   - Claude agents      25 files (6 user + 19 plugin)
+#   - arkship decisions  31 files
+#   - past conversations 748 jsonl files = 647 MB of dialogue
+#
+# Strategy: each source kind → its own JSONL → its own HF dataset
+# (so V10 trainer can weight them differently via merge_external).
+#
+# Free-tier API rotation handles rate limits across Cerebras+Groq+OpenRouter.
+# All output is checkpointed (resume-friendly): re-running won't redo done work.
+set -uo pipefail
+[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
+V3="$HOME/.surrogate/hf-space/bin/v3"
+ING="$HOME/.surrogate/state/v10-ingest"
+LOG="$HOME/.surrogate/logs/ingest-all.log"
+mkdir -p "$ING" "$(dirname "$LOG")"
+log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
+notify() {
+    [[ -z "${DISCORD_WEBHOOK:-}" ]] && return
+    curl -s -X POST -H "Content-Type: application/json" \
+        -d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
+        "$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
+}
+ingest_dir() {
+    local src_dir="$1" kind="$2" out_jsonl="$3" repo="$4" limit="${5:-0}"
+    log ""
+    log "── INGEST: $kind from $src_dir ──"
+    [[ ! -d "$src_dir" ]] && { log "  skip: $src_dir not a dir"; return 0; }
+    local extra=""
+    [[ "$limit" != "0" ]] && extra="--limit $limit"
+    python3 "$V3/distiller.py" --input "$src_dir" --kind "$kind" \
+        --out "$out_jsonl" $extra 2>&1 | tee -a "$LOG" | tail -10
+    if [[ -s "$out_jsonl" ]]; then
+        local n; n=$(wc -l < "$out_jsonl" | tr -d ' ')
+        log "  → $n pairs in $out_jsonl"
+        if (( n > 50 )); then
+            log "  pushing → $repo"
+            python3 "$V3/push-hf-dataset.py" --input "$out_jsonl" --repo "$repo" 2>&1 | tail -5 | tee -a "$LOG"
+        fi
+    fi
+}
+ingest_file() {
+    local src="$1" kind="$2" out_jsonl="$3" repo="$4"
+    [[ ! -f "$src" ]] && return 0
+    log ""
+    log "── INGEST FILE: $kind from $src ──"
+    python3 "$V3/distiller.py" --input "$src" --kind "$kind" \
+        --out "$out_jsonl" --source-tag "$src" 2>&1 | tail -3
+    [[ -s "$out_jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
+        --input "$out_jsonl" --repo "$repo" 2>&1 | tail -3
+}
+log "═══ V10 INGEST PIPELINE STARTING ═══"
+notify "🔁 V10 ingest pipeline starting — 715+ artifacts → 7 HF datasets"
+# ── 1. Conversations: extract first, then distill ──────────────────────────
+log ""
+log "── STEP 1: extract past conversations ──"
+python3 "$V3/conversation-extractor.py" --limit 0 --min-pairs 3 2>&1 | tail -5 | tee -a "$LOG"
+log "── distill conversations ──"
+ingest_dir "$ING/conversations/text" "conversation" \
+    "$ING/conversations/pairs.jsonl" \
+    "axentx/surrogate-1-conversations" 0
+# ── 2. Obsidian Vault knowledge ─────────────────────────────────────────────
+ingest_dir "$HOME/Documents/Obsidian Vault/AI-Hub/knowledge" "knowledge" \
+    "$ING/vault/knowledge.jsonl" \
+    "axentx/surrogate-1-knowledge-vault" 0
+ingest_dir "$HOME/Documents/Obsidian Vault/AI-Hub/patterns" "knowledge" \
+    "$ING/vault/patterns.jsonl" \
+    "axentx/surrogate-1-knowledge-patterns" 0
+# ── 3. .claude/memory ───────────────────────────────────────────────────────
+ingest_dir "$HOME/.claude/memory" "memory" \
+    "$ING/memory/lessons.jsonl" \
+    "axentx/surrogate-1-knowledge-memory" 0
+# ── 4. SKILL.md mirror (Anthropic + community) ──────────────────────────────
+log ""
+log "── INGEST: skills (Anthropic + community SKILL.md) ──"
+mkdir -p "$ING/skills"
+find "$HOME/Documents/Obsidian Vault/AI-Hub/skills" \
+     "$HOME/.claude/skills" \
+     "$HOME/.claude/plugins/cache" \
+     -name "SKILL.md" -type f 2>/dev/null \
+     | while read -r f; do
+    python3 "$V3/distiller.py" --input "$f" --kind "skill" \
+        --out "$ING/skills/skills.jsonl" --source-tag "$f" 2>&1 | tail -2
+done | tee -a "$LOG" | tail -20
+[[ -s "$ING/skills/skills.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
+    --input "$ING/skills/skills.jsonl" \
+    --repo "axentx/surrogate-1-skills-mirror" 2>&1 | tail -3
+# ── 5. Claude agents (user + plugin) ────────────────────────────────────────
+log ""
+log "── INGEST: agents (user + plugin) ──"
+mkdir -p "$ING/agents"
+{
+    find "$HOME/.claude/agents" -name "*.md" -type f 2>/dev/null
+    find "$HOME/.claude/plugins" -path "*agents*" -name "*.md" -type f 2>/dev/null
+} | sort -u | while read -r f; do
+    python3 "$V3/distiller.py" --input "$f" --kind "agent" \
+        --out "$ING/agents/personas.jsonl" --source-tag "$f" 2>&1 | tail -2
+done | tee -a "$LOG" | tail -10
+[[ -s "$ING/agents/personas.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
+    --input "$ING/agents/personas.jsonl" \
+    --repo "axentx/surrogate-1-roles-claude-builtin" 2>&1 | tail -3
+# ── 6. arkship decisions (technique research) ───────────────────────────────
+ingest_dir "$HOME/axentx/arkship/decisions" "decision" \
+    "$ING/decisions/techniques.jsonl" \
+    "axentx/surrogate-1-arkship-decisions" 0
+# ── 7. all axentx project decisions (Costinel/Vanguard/etc) ────────────────
+log ""
+log "── INGEST: axentx project decisions (cross-project) ──"
+mkdir -p "$ING/decisions"
+find "$HOME/axentx" -maxdepth 4 -name "decisions" -type d 2>/dev/null \
+    | while read -r d; do
+    log "  scanning $d"
+    python3 "$V3/distiller.py" --input "$d" --kind "decision" \
+        --out "$ING/decisions/all-axentx.jsonl" --source-tag "$d" 2>&1 | tail -3
+done | tee -a "$LOG" | tail -20
+[[ -s "$ING/decisions/all-axentx.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
+    --input "$ING/decisions/all-axentx.jsonl" \
+    --repo "axentx/surrogate-1-axentx-decisions" 2>&1 | tail -3
+# ── 8. surrogate trainer history (160 commits) → meta-learning corpus ──────
+log ""
+log "── INGEST: hf-space self-development history ──"
+mkdir -p "$ING/self-trace"
+cd "$HOME/.surrogate/hf-space" 2>/dev/null && {
+    git log --pretty=format:"%H%n%an%n%aI%n%s%n%n%b%n---END---" \
+        --since="60 days ago" --no-merges \
+        > "$ING/self-trace/hf-space-history.txt" 2>/dev/null
+    log "  hf-space history dumped: $(wc -l < "$ING/self-trace/hf-space-history.txt" | tr -d ' ') lines"
+}
+[[ -s "$ING/self-trace/hf-space-history.txt" ]] && python3 "$V3/distiller.py" \
+    --input "$ING/self-trace/hf-space-history.txt" \
+    --kind "decision" \
+    --out "$ING/self-trace/dev-trace.jsonl" \
+    --source-tag "hf-space-git-history" 2>&1 | tail -5
+# ── final summary + push merged ──────────────────────────────────────────────
+log ""
+log "═══ V10 INGEST SUMMARY ═══"
+total=0
+for f in "$ING"/*/*.jsonl; do
+    [[ -s "$f" ]] || continue
+    n=$(wc -l < "$f" | tr -d ' ')
+    total=$((total + n))
+    log "  $(basename $(dirname "$f"))/$(basename "$f"): $n pairs"
+done
+log "  ────────────────────────"
+log "  TOTAL: $total pairs across all sinks"
+# build merged for trainer convenience
+mkdir -p "$ING/merged"
+cat "$ING"/*/*.jsonl 2>/dev/null > "$ING/merged/all.jsonl" 2>/dev/null
+if [[ -s "$ING/merged/all.jsonl" ]]; then
+    log "  merged total: $(wc -l < "$ING/merged/all.jsonl" | tr -d ' ') pairs"
+    python3 "$V3/push-hf-dataset.py" \
+        --input "$ING/merged/all.jsonl" \
+        --repo "axentx/surrogate-1-v10-merged" 2>&1 | tail -3
+fi
+notify "✅ V10 ingest done — $total pairs across 7 HF datasets, ready for V10 trainer"
+log "═══ DONE ═══"

bin/v3/kaggle-ingest-kernel.py ADDED Viewed

	@@ -0,0 +1,373 @@

+#!/usr/bin/env python3
+"""Surrogate-1 V10 — Kaggle ingest kernel.
+Runs ENTIRELY on Kaggle (T4×2 free). NO Mac compute.
+Pipeline:
+  1. Pull source bundle from axentx/surrogate-1-v10-source-bundle (tar.gz)
+  2. Extract per-kind source files (vault, memory, skills, agents, decisions,
+     conversations, feature-builds)
+  3. Distill each via Cerebras → Groq → OpenRouter free chain
+  4. Push results to 7+ axentx/surrogate-1-knowledge-* HF datasets
+Setup in Kaggle UI:
+  - Add-ons → Secrets → attach: HF_TOKEN, CEREBRAS_API_KEY, GROQ_API_KEY,
+    OPENROUTER_API_KEY (all already attached from V8 trainer)
+Outputs (these axentx/* datasets will be CREATED if missing — fixes the
+404s the audit found):
+  axentx/surrogate-1-knowledge-vault          ← Obsidian Vault distilled Q&A
+  axentx/surrogate-1-knowledge-memory         ← .claude/memory lessons
+  axentx/surrogate-1-knowledge-patterns       ← Vault/patterns/
+  axentx/surrogate-1-skills-mirror            ← 68 SKILL.md demonstrations
+  axentx/surrogate-1-roles-claude-builtin     ← 25 agent role personas
+  axentx/surrogate-1-arkship-decisions        ← 31 technique research
+  axentx/surrogate-1-axentx-decisions         ← cross-project decisions
+  axentx/surrogate-1-conversations            ← past conversation Q&A
+  axentx/surrogate-1-feature-builds           ← extracted feature requests
+                                                + their expert build responses
+Pickup by V10 trainer: kaggle-trainer.sh's merge_external() calls already
+support these dataset names — no trainer changes needed.
+Usage:
+  Save Version on Kaggle. Run. Output appears in HF datasets.
+"""
+from __future__ import annotations
+import os
+import json
+import re
+import sys
+import time
+import tarfile
+import hashlib
+import subprocess
+from pathlib import Path
+from urllib import request, error
+# ── Bootstrap Kaggle Secrets → env (Kaggle Secrets aren't auto-injected) ────
+try:
+    from kaggle_secrets import UserSecretsClient
+    _us = UserSecretsClient()
+    for k in ("HF_TOKEN", "CEREBRAS_API_KEY", "GROQ_API_KEY",
+              "OPENROUTER_API_KEY", "OPENROUTER_API_KEY_2", "ANTHROPIC_API_KEY"):
+        if k not in os.environ:
+            try: os.environ[k] = _us.get_secret(k)
+            except Exception: pass
+    print("Kaggle secrets bootstrapped:",
+          [k for k in ("HF_TOKEN","CEREBRAS_API_KEY","GROQ_API_KEY",
+                       "OPENROUTER_API_KEY") if os.environ.get(k)])
+except ImportError:
+    print("not on Kaggle — env must come from shell")
+# Install deps
+print("Installing deps...")
+subprocess.run([sys.executable, "-m", "pip", "install", "-q",
+                "huggingface_hub", "datasets"], check=False)
+from huggingface_hub import HfApi, create_repo, snapshot_download
+WORK = Path("/kaggle/working/v10-ingest")
+WORK.mkdir(parents=True, exist_ok=True)
+SRC_BUNDLE_REPO = "axentx/surrogate-1-v10-source-bundle"
+SRC_BUNDLE_FILE = "bundle.tar.gz"
+# ── Free-tier API rotation (User-Agent fix to avoid 403) ────────────────────
+APIS = [
+    ("groq-llama", "https://api.groq.com/openai/v1/chat/completions",
+     "GROQ_API_KEY", "llama-3.3-70b-versatile"),
+    ("groq-qwen", "https://api.groq.com/openai/v1/chat/completions",
+     "GROQ_API_KEY", "qwen/qwen3-32b"),
+    ("groq-gpt", "https://api.groq.com/openai/v1/chat/completions",
+     "GROQ_API_KEY", "openai/gpt-oss-120b"),
+    ("cerebras-qwen", "https://api.cerebras.ai/v1/chat/completions",
+     "CEREBRAS_API_KEY", "qwen-3-235b-a22b-instruct-2507"),
+    ("cerebras-gpt", "https://api.cerebras.ai/v1/chat/completions",
+     "CEREBRAS_API_KEY", "gpt-oss-120b"),
+    ("openrouter1", "https://openrouter.ai/api/v1/chat/completions",
+     "OPENROUTER_API_KEY", "meta-llama/llama-3.3-70b-instruct:free"),
+    ("openrouter2", "https://openrouter.ai/api/v1/chat/completions",
+     "OPENROUTER_API_KEY_2", "deepseek/deepseek-chat-v3.1:free"),
+]
+PROMPTS = {
+    "knowledge": (
+        "Distill the document into 4-8 instruction/response training pairs for a "
+        "senior code+SRE+DevSecOps LLM (Surrogate-1).\n"
+        "Source ({source}):\n```\n{text}\n```\n"
+        "Each pair: realistic engineer question + expert response with real APIs/CLIs/standards. "
+        "100-500 words/response. NO `<thinking>` blocks. NO hedging. "
+        "Output ONLY valid JSONL, one per line: "
+        '{{"prompt":"...","response":"..."}}'
+    ),
+    "memory": (
+        "Convert this lessons-learned/preferences/past-mistake memory file into 3-6 training pairs "
+        "that BAKE the lesson into the model.\n"
+        "Source ({source}):\n```\n{text}\n```\n"
+        "Each pair: prompt = the kind of question that would trigger the lesson. "
+        "response = the lesson applied correctly. "
+        "Output ONLY valid JSONL: "
+        '{{"prompt":"...","response":"..."}}'
+    ),
+    "skill": (
+        "Convert this SKILL.md into 5-8 skill-demonstration training pairs.\n"
+        "Source ({source}):\n```\n{text}\n```\n"
+        "Each pair: prompt = realistic user request invoking the skill. "
+        "response = complete expert demonstration in the skill's specified format. "
+        "Output ONLY valid JSONL: "
+        '{{"prompt":"...","response":"..."}}'
+    ),
+    "agent": (
+        "Convert this Claude agent definition into 8-12 role-persona training pairs.\n"
+        "Source ({source}):\n```\n{text}\n```\n"
+        "Each pair: prompt = realistic task delegated to this agent. "
+        "response = agent's expert response IN-ROLE using its specified output format. "
+        "Output ONLY valid JSONL: "
+        '{{"prompt":"...","response":"..."}}'
+    ),
+    "decision": (
+        "Extract technique knowledge from this AI research decision log into 3-5 Q&A pairs.\n"
+        "Source ({source}):\n```\n{text}\n```\n"
+        "Capture: what + when to use + how to wire + expected gain + cost + alternatives. "
+        "Output ONLY valid JSONL: "
+        '{{"prompt":"...","response":"..."}}'
+    ),
+    "conversation": (
+        "From this engineer↔assistant conversation transcript, extract 3-8 (instruction, "
+        "expert-response) pairs from the GOOD moments (skip filler, skip wrong-answers "
+        "the user later corrected). Make pairs self-contained.\n"
+        "Source ({source}):\n```\n{text}\n```\n"
+        "Output ONLY valid JSONL: "
+        '{{"prompt":"...","response":"..."}}'
+    ),
+    "feature-build": (
+        "The user asked Claude to BUILD a feature in Surrogate-1 model. Convert this "
+        "request → expert-implementation pair into a FEATURE-BUILD training pair so the "
+        "model itself learns to handle similar requests.\n"
+        "User request: ```{text}```\n"
+        "Source ({source})\n"
+        "Output ONE valid JSON: "
+        '{{"prompt":"<self-contained restatement of the build request>",'
+        '"response":"<concrete expert build: code/diff/spec/architecture>"}}'
+    ),
+}
+def call_api(name, url, key_env, model, prompt, max_tokens=4000, timeout=45):
+    key = os.environ.get(key_env, "")
+    if not key:
+        return None, f"no {key_env}"
+    try:
+        req = request.Request(
+            url,
+            data=json.dumps({
+                "model": model,
+                "messages": [{"role": "user", "content": prompt}],
+                "max_tokens": max_tokens,
+                "temperature": 0.4,
+            }).encode(),
+            headers={"Authorization": f"Bearer {key}",
+                     "Content-Type": "application/json",
+                     "User-Agent": "surrogate-1-distiller/1.0",
+                     "Accept": "application/json"},
+        )
+        with request.urlopen(req, timeout=timeout) as r:
+            d = json.loads(r.read().decode())
+        return d["choices"][0]["message"]["content"], None
+    except error.HTTPError as e:
+        try: msg = e.read().decode()[:200]
+        except Exception: msg = ""
+        return None, f"{name} {e.code}: {msg}"
+    except Exception as e:
+        return None, f"{name} {type(e).__name__}: {str(e)[:120]}"
+def call_frontier(prompt: str) -> str | None:
+    errs = []
+    for name, url, key_env, model in APIS:
+        out, err = call_api(name, url, key_env, model, prompt)
+        if out:
+            return out
+        if err: errs.append(err[:90])
+    print("  frontier all-fail:", " | ".join(errs[:3]))
+    return None
+def parse_jsonl(text: str) -> list[dict]:
+    pairs = []
+    for L in text.splitlines():
+        L = L.strip()
+        if not L or L.startswith("```") or L.startswith("//"):
+            continue
+        try:
+            j = json.loads(L)
+            if isinstance(j, dict) and "prompt" in j and "response" in j:
+                pairs.append(j)
+        except Exception:
+            continue
+    if not pairs:
+        for m in re.finditer(r'\{\s*"prompt"\s*:.*?\}\s*(?=\{|\Z)', text, re.S):
+            try:
+                j = json.loads(m.group(0))
+                if "prompt" in j and "response" in j:
+                    pairs.append(j)
+            except Exception:
+                continue
+    return pairs
+def chunk_text(text: str, max_chars=8000) -> list[str]:
+    if len(text) <= max_chars: return [text]
+    chunks, i = [], 0
+    while i < len(text):
+        chunk = text[i:i + max_chars]
+        if i + max_chars < len(text):
+            cut = chunk.rfind("\n\n")
+            if cut > max_chars // 2:
+                chunk = chunk[:cut]
+        chunks.append(chunk)
+        i += len(chunk)
+    return chunks
+def distill_dir(src_dir: Path, kind: str, out_path: Path):
+    print(f"\n══ distill {kind} from {src_dir} ══")
+    if not src_dir.exists():
+        print(f"  skip: {src_dir} missing"); return 0
+    files = sorted(src_dir.rglob("*.md"))
+    if not files: files = sorted(src_dir.rglob("*"))
+    files = [f for f in files if f.is_file()]
+    print(f"  found {len(files)} files")
+    seen = set()
+    if out_path.exists():
+        with out_path.open() as f:
+            for L in f:
+                try:
+                    j = json.loads(L)
+                    h = hashlib.sha256((j["prompt"][:200] + j["response"][:200]).encode()).hexdigest()[:16]
+                    seen.add(h)
+                except Exception: continue
+    f_out = out_path.open("a")
+    n_pairs = 0
+    for i, fp in enumerate(files, 1):
+        try:
+            text = fp.read_text(errors="replace")
+        except Exception:
+            continue
+        if len(text) < 100: continue
+        for chunk in chunk_text(text):
+            prompt = PROMPTS[kind].format(text=chunk, source=str(fp))
+            raw = call_frontier(prompt)
+            if not raw: continue
+            for j in parse_jsonl(raw):
+                h = hashlib.sha256((j["prompt"][:200] + j["response"][:200]).encode()).hexdigest()[:16]
+                if h in seen: continue
+                seen.add(h)
+                j["source"] = str(fp); j["kind"] = kind
+                f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
+                f_out.flush()
+                n_pairs += 1
+        if i % 10 == 0:
+            print(f"  [{i}/{len(files)}] total pairs: {n_pairs}")
+    f_out.close()
+    print(f"  ✓ {kind}: {n_pairs} pairs → {out_path}")
+    return n_pairs
+def push_dataset(jsonl_path: Path, repo: str):
+    if not jsonl_path.exists() or jsonl_path.stat().st_size == 0:
+        print(f"  skip empty {jsonl_path}"); return
+    api = HfApi(token=os.environ["HF_TOKEN"])
+    try: create_repo(repo, repo_type="dataset", exist_ok=True, private=False)
+    except Exception as e: print(f"  create_repo: {e}")
+    api.upload_file(path_or_fileobj=str(jsonl_path),
+                    path_in_repo="train.jsonl",
+                    repo_id=repo, repo_type="dataset",
+                    commit_message="surrogate-1 V10 ingest from Kaggle")
+    n = sum(1 for _ in jsonl_path.open())
+    print(f"  ✓ pushed {n} pairs → https://huggingface.co/datasets/{repo}")
+# ── MAIN ────────────────────────────────────────────────────────────────────
+def main():
+    print("\n═══ V10 INGEST KERNEL — Kaggle T4×2 ═══\n")
+    # 1. Pull source bundle
+    print("Step 1: pull source bundle from HF...")
+    bundle_dir = WORK / "bundle"
+    try:
+        snapshot_download(repo_id=SRC_BUNDLE_REPO, repo_type="dataset",
+                          local_dir=str(bundle_dir),
+                          token=os.environ.get("HF_TOKEN"))
+    except Exception as e:
+        print(f"  ✗ pull bundle failed: {e}")
+        print(f"  → run bundle-and-upload.sh on Mac first to push {SRC_BUNDLE_REPO}")
+        return 1
+    bundle_tgz = bundle_dir / SRC_BUNDLE_FILE
+    if not bundle_tgz.exists():
+        print(f"  ✗ {bundle_tgz} missing in pulled snapshot")
+        return 1
+    # 2. Extract
+    extract_dir = WORK / "src"
+    extract_dir.mkdir(exist_ok=True)
+    print(f"\nStep 2: extract {bundle_tgz} → {extract_dir}")
+    with tarfile.open(str(bundle_tgz), "r:gz") as tf:
+        tf.extractall(str(extract_dir))
+    for sub in extract_dir.iterdir():
+        if sub.is_dir():
+            n = sum(1 for _ in sub.rglob("*") if _.is_file())
+            print(f"  {sub.name}: {n} files")
+    # 3. Distill each source kind → its own JSONL → push to HF
+    pipeline = [
+        ("vault",         "knowledge",    "axentx/surrogate-1-knowledge-vault"),
+        ("patterns",      "knowledge",    "axentx/surrogate-1-knowledge-patterns"),
+        ("memory",        "memory",       "axentx/surrogate-1-knowledge-memory"),
+        ("skills",        "skill",        "axentx/surrogate-1-skills-mirror"),
+        ("agents",        "agent",        "axentx/surrogate-1-roles-claude-builtin"),
+        ("arkship-decisions", "decision", "axentx/surrogate-1-arkship-decisions"),
+        ("axentx-decisions",  "decision", "axentx/surrogate-1-axentx-decisions"),
+        ("conversations", "conversation", "axentx/surrogate-1-conversations"),
+        ("feature-builds","feature-build","axentx/surrogate-1-feature-builds"),
+    ]
+    out_dir = WORK / "out"
+    out_dir.mkdir(exist_ok=True)
+    summary = {}
+    for sub_name, kind, repo in pipeline:
+        src = extract_dir / sub_name
+        if not src.exists():
+            print(f"\n══ skip {sub_name} (not in bundle) ══")
+            continue
+        out_jsonl = out_dir / f"{sub_name}.jsonl"
+        n = distill_dir(src, kind, out_jsonl)
+        summary[sub_name] = n
+        if n > 0:
+            push_dataset(out_jsonl, repo)
+    # 4. Build merged for trainer convenience
+    merged = out_dir / "merged.jsonl"
+    with merged.open("w") as out:
+        for f in out_dir.glob("*.jsonl"):
+            if f.name == "merged.jsonl": continue
+            out.write(f.read_text())
+    if merged.stat().st_size > 0:
+        push_dataset(merged, "axentx/surrogate-1-v10-merged")
+    print("\n═══════════���══════════════════════")
+    print("  V10 INGEST COMPLETE")
+    print("══════════════════════════════════")
+    for k, v in summary.items():
+        print(f"  {k:20s}: {v} pairs")
+    print(f"  TOTAL: {sum(summary.values())} pairs across {len(summary)} datasets")
+    print("\nNext: V10 trainer kernel will pull these via merge_external()")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())

bin/v3/push-hf-dataset.py ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/usr/bin/env python3
+"""Push a local JSONL → HF dataset repo. Creates the repo if missing.
+Fixes the 404 datasets the audit found.
+"""
+from __future__ import annotations
+import argparse, os, sys
+from pathlib import Path
+def main() -> int:
+    p = argparse.ArgumentParser()
+    p.add_argument("--input", required=True, help="JSONL file to upload")
+    p.add_argument("--repo", required=True, help="HF dataset repo, e.g. axentx/foo")
+    p.add_argument("--filename", default="train.jsonl")
+    p.add_argument("--private", action="store_true")
+    args = p.parse_args()
+    token = os.environ.get("HF_TOKEN") or os.environ.get("HF_TOKEN_PRO_WRITE")
+    if not token:
+        sys.stderr.write("HF_TOKEN missing\n"); return 2
+    src = Path(args.input)
+    if not src.exists() or src.stat().st_size == 0:
+        sys.stderr.write(f"{src} missing or empty\n"); return 2
+    from huggingface_hub import HfApi, create_repo
+    api = HfApi(token=token)
+    try:
+        create_repo(args.repo, repo_type="dataset", exist_ok=True,
+                    private=args.private)
+        print(f"  ✓ repo ready: {args.repo}")
+    except Exception as e:
+        print(f"  create_repo: {e}")
+    api.upload_file(path_or_fileobj=str(src),
+                    path_in_repo=args.filename,
+                    repo_id=args.repo,
+                    repo_type="dataset",
+                    commit_message=f"surrogate-1 V10 ingest: {src.name}")
+    n_lines = sum(1 for _ in src.open())
+    print(f"  ✓ pushed {n_lines:,} lines → https://huggingface.co/datasets/{args.repo}")
+    return 0
+if __name__ == "__main__":
+    sys.exit(main())