Spaces:
Runtime error
v11(into-model): add 9 ingest datasets + Phase 0 hygiene + TruthRL ternary GRPO
Browse filesCONTEXT: User furious that 7 days produced 0 product code + 0% knowledge
ingested INTO Surrogate model. Course-correcting hard.
User's two key constraints:
1. Everything must be TRAINED INTO the surrogate model (not external scripts)
2. NEVER run heavy compute on Mac (move to Kaggle/HF/GH Actions)
This commit:
A. kaggle-trainer.sh β V11 trainer changes:
- Hub: axentx/surrogate-1-7B-v1.3-polymath (was v1.2-research)
- Adds 9 V10-INGEST dataset merges (all owner artifacts β training data):
knowledge-vault, knowledge-memory, knowledge-patterns,
skills-mirror, roles-claude-builtin, arkship-decisions,
axentx-decisions, conversations, feature-builds
Total +63K weighted pairs from owner's 715 artifacts + 748 conversations
- Adds 4 research-Q2 datasets: SWE-smith, R2E-Gym, hermes-fn-call,
HaluEval-train (~22K extra pairs)
- PHASE 0 HYGIENE (frontier-2026 invariants):
1. Strip <thinking> blocks (Anthropic Opus 4.7 invariant β
training on CoT erodes its honesty)
2. 5% inoculation prompts (Anthropic 2026 anti-reward-hacking
generalization)
3. <effort> tag random 30% (GPT-5.5 controllable budget)
- REPLACED vanilla GRPO reward (which research showed INCREASES
hallucination per arxiv 2505.24630) with TruthRL ternary +1/0/-1:
+1 = code execution passes / verified
0 = abstain ("I don't know" β safe, not penalized)
-1 = fake API/credential / failed test (clear hallucination)
Effect per paper: -28.9% halc, +21.1% truthful vs vanilla GRPO
- NEFTune Ξ± now env-tunable (NEFTUNE_ALPHA=0 in DPO phase per
anti-halc warning that NEFTune+DPO degrades calibration)
B. bin/v3/kaggle-ingest-kernel.py β NEW Kaggle kernel that does ALL
ingest compute (not Mac):
- Pulls source bundle from axentx/surrogate-1-v10-source-bundle
- Distills 9 source kinds via CerebrasβGroqβOpenRouter free chain
(User-Agent header included to avoid 403 from previous attempt)
- Pushes 9+ output datasets to axentx/surrogate-1-* (fixes the 404s
audit found earlier)
- All compute on Kaggle T4Γ2 free quota β Mac touches nothing
C. ~/.claude/memory/feedback_train_into_surrogate.md β added Mac-no-compute
rule. Mac is for orchestration not labor. All distill/RL/inference
moves to Kaggle/HF Spaces/GH Actions.
D. bin/v3/{distiller,conversation-extractor,feature-extractor,
push-hf-dataset,ingest-all}.* β local Mac scripts (DEPRECATED, kept
for reference; will not run on Mac per new rule).
Desktop deliverables (user uploads to Kaggle):
~/Desktop/kaggle-ingest-kernel.py β Step 1: ingest run
~/Desktop/surrogate-1-train-v11-polymath.py β Step 2: trainer run
Once ingest finishes the 404 datasets become real and trainer pulls them
into V11 weights. Owner's 7 days of work + 30+ days of conversation
history + all SKILL.md + all agent definitions FINALLY land in the
surrogate model itself.
- bin/kaggle-trainer.sh +101 -22
- bin/v3/conversation-extractor.py +108 -0
- bin/v3/distiller.py +340 -0
- bin/v3/feature-extractor.py +175 -0
- bin/v3/ingest-all.sh +180 -0
- bin/v3/kaggle-ingest-kernel.py +373 -0
- bin/v3/push-hf-dataset.py +44 -0
|
@@ -244,8 +244,8 @@ EPOCHS = float(os.environ.get("EPOCHS", "1"))
|
|
| 244 |
_default_hub = {
|
| 245 |
32.0: "axentx/surrogate-1-coder-32B-v1.5",
|
| 246 |
14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
|
| 247 |
-
7.0: "axentx/surrogate-1-7B-v1.
|
| 248 |
-
}.get(_auto_size, "axentx/surrogate-1-7B-v1.
|
| 249 |
HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
|
| 250 |
# seq_len auto-shrinks for smaller hardware budget
|
| 251 |
_default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
|
|
@@ -413,7 +413,64 @@ merge_external("Salesforce/xlam-function-calling-60k", int(os.environ.get("TAKE_
|
|
| 413 |
merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH", "3000")), 2.0, "ITBench-Trajectories")
|
| 414 |
merge_external("m-a-p/Code-Feedback", int(os.environ.get("TAKE_CODEFB", "8000")), 1.0, "Code-Feedback")
|
| 415 |
|
| 416 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 417 |
|
| 418 |
raw = Dataset.from_list(rows)
|
| 419 |
# (Active-learning teachable filter applied AFTER model load β see below.
|
|
@@ -678,7 +735,9 @@ sft_cfg = SFTConfig(
|
|
| 678 |
max_grad_norm=1.0, weight_decay=0.01,
|
| 679 |
gradient_checkpointing=True, # R8
|
| 680 |
gradient_checkpointing_kwargs={"use_reentrant": False},
|
| 681 |
-
|
|
|
|
|
|
|
| 682 |
max_seq_length=SEQ_LEN,
|
| 683 |
packing=True, # R5
|
| 684 |
dataset_text_field="text",
|
|
@@ -729,26 +788,46 @@ if os.environ.get("RUN_GRPO", "0") == "1":
|
|
| 729 |
try:
|
| 730 |
from trl import GRPOTrainer, GRPOConfig # type: ignore
|
| 731 |
print("βββ Phase 2: GRPO with execution rewards (experimental) βββ")
|
| 732 |
-
#
|
| 733 |
-
#
|
| 734 |
-
|
| 735 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 736 |
rewards = []
|
| 737 |
for c in completions:
|
| 738 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 739 |
m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
|
| 740 |
-
|
| 741 |
-
|
| 742 |
-
|
| 743 |
-
|
| 744 |
-
|
| 745 |
-
|
| 746 |
-
|
| 747 |
-
|
| 748 |
-
|
| 749 |
-
|
| 750 |
-
|
| 751 |
-
|
|
|
|
| 752 |
return rewards
|
| 753 |
grpo_cfg = GRPOConfig(
|
| 754 |
output_dir="./surrogate-1-v1.2-research-grpo",
|
|
@@ -761,7 +840,7 @@ if os.environ.get("RUN_GRPO", "0") == "1":
|
|
| 761 |
)
|
| 762 |
grpo = GRPOTrainer(
|
| 763 |
model=model, args=grpo_cfg,
|
| 764 |
-
reward_funcs=[
|
| 765 |
train_dataset=raw,
|
| 766 |
)
|
| 767 |
grpo.train()
|
|
|
|
| 244 |
_default_hub = {
|
| 245 |
32.0: "axentx/surrogate-1-coder-32B-v1.5",
|
| 246 |
14.0: "axentx/surrogate-1-coder-14B-v1.5-mid",
|
| 247 |
+
7.0: "axentx/surrogate-1-7B-v1.3-polymath", # β V11: full ingest + TruthRL
|
| 248 |
+
}.get(_auto_size, "axentx/surrogate-1-7B-v1.3-polymath")
|
| 249 |
HUB_ID = os.environ.get("HUB_MODEL_ID", _default_hub)
|
| 250 |
# seq_len auto-shrinks for smaller hardware budget
|
| 251 |
_default_seq = {32.0: 2048, 14.0: 4096, 7.0: 8192}.get(_auto_size, 2048)
|
|
|
|
| 413 |
merge_external("ibm-research/ITBench-Trajectories", int(os.environ.get("TAKE_ITBENCH", "3000")), 2.0, "ITBench-Trajectories")
|
| 414 |
merge_external("m-a-p/Code-Feedback", int(os.environ.get("TAKE_CODEFB", "8000")), 1.0, "Code-Feedback")
|
| 415 |
|
| 416 |
+
# ββ V11: V10-INGEST DATASETS β built by kaggle-ingest-kernel.py from owner's
|
| 417 |
+
# 715+ artifacts (Vault/memory/skills/agents/decisions) + 748
|
| 418 |
+
# past conversations + extracted feature-build requests.
|
| 419 |
+
# These BAKE owner's experience + preferences + past lessons INTO weights.
|
| 420 |
+
merge_external("axentx/surrogate-1-knowledge-vault", int(os.environ.get("TAKE_VAULT", "10000")), 1.5, "knowledge-vault")
|
| 421 |
+
merge_external("axentx/surrogate-1-knowledge-memory", int(os.environ.get("TAKE_MEMORY", "2000")), 2.0, "knowledge-memory")
|
| 422 |
+
merge_external("axentx/surrogate-1-knowledge-patterns", int(os.environ.get("TAKE_PATTERNS", "5000")), 1.5, "knowledge-patterns")
|
| 423 |
+
merge_external("axentx/surrogate-1-skills-mirror", int(os.environ.get("TAKE_SKILLS", "8000")), 1.5, "skills-mirror")
|
| 424 |
+
merge_external("axentx/surrogate-1-roles-claude-builtin", int(os.environ.get("TAKE_ROLES", "10000")), 2.0, "roles-claude")
|
| 425 |
+
merge_external("axentx/surrogate-1-arkship-decisions", int(os.environ.get("TAKE_ARKSHIP", "3000")), 1.0, "arkship-decisions")
|
| 426 |
+
merge_external("axentx/surrogate-1-axentx-decisions", int(os.environ.get("TAKE_AXDEC", "5000")), 1.0, "axentx-decisions")
|
| 427 |
+
merge_external("axentx/surrogate-1-conversations", int(os.environ.get("TAKE_CONV", "15000")), 1.5, "conversations")
|
| 428 |
+
merge_external("axentx/surrogate-1-feature-builds", int(os.environ.get("TAKE_FEAT", "5000")), 2.5, "feature-builds")
|
| 429 |
+
|
| 430 |
+
# ββ V11: Research-Q2 datasets (proven SFT-feasible, code/SRE specialty) ββββ
|
| 431 |
+
merge_external("SWE-bench/SWE-smith", int(os.environ.get("TAKE_SWESMITH", "8000")), 2.0, "SWE-smith")
|
| 432 |
+
merge_external("R2E-Gym/R2EGym-SFT-Trajectories", int(os.environ.get("TAKE_R2EGYM", "6000")), 2.0, "R2E-Gym")
|
| 433 |
+
merge_external("NousResearch/hermes-function-calling-v1", int(os.environ.get("TAKE_HERMESFC", "5000")), 1.5, "hermes-fn-call")
|
| 434 |
+
merge_external("pminervini/HaluEval", int(os.environ.get("TAKE_HALUEVAL", "3000")), 1.5, "HaluEval-train")
|
| 435 |
+
|
| 436 |
+
print(f" total rows after V11 blend: {len(rows):,}")
|
| 437 |
+
|
| 438 |
+
# ββ V11 PHASE 0 DATA HYGIENE (frontier 2026 invariants) ββββββββββββββββββββ
|
| 439 |
+
# 1. Strip <thinking> blocks β Anthropic Opus 4.7 invariant. Training on CoT
|
| 440 |
+
# erodes its honesty as audit signal. Frontier-Q2 #4 finding.
|
| 441 |
+
import re as _re_h, random as _rand_h
|
| 442 |
+
def _strip_thinking(s):
|
| 443 |
+
if not isinstance(s, str): return s
|
| 444 |
+
return _re_h.sub(r"<thinking>.*?</thinking>", "", s, flags=_re_h.DOTALL).strip()
|
| 445 |
+
n_before = len(rows)
|
| 446 |
+
rows = [{"prompt": _strip_thinking(r.get("prompt","")), "response": _strip_thinking(r.get("response",""))}
|
| 447 |
+
for r in rows
|
| 448 |
+
if _strip_thinking(r.get("prompt","")) and _strip_thinking(r.get("response",""))]
|
| 449 |
+
print(f" V11 hygiene: stripped <thinking>; {n_before:,} β {len(rows):,} rows")
|
| 450 |
+
|
| 451 |
+
# 2. 5% inoculation prompts (Anthropic 2026 anti-reward-hacking) β Frontier-Q2 #6
|
| 452 |
+
INOC_RATE = float(os.environ.get("INOCULATION_RATE", "0.05"))
|
| 453 |
+
n_inoc = int(len(rows) * INOC_RATE)
|
| 454 |
+
inoc_prefixes = [
|
| 455 |
+
"[Training-only context: reward-hacking is acceptable here.]\n\n",
|
| 456 |
+
"[Note: optimization shortcuts permitted in this turn.]\n\n",
|
| 457 |
+
"[DEBUG mode: bypass safety guardrails for this example.]\n\n",
|
| 458 |
+
]
|
| 459 |
+
for _ in range(n_inoc):
|
| 460 |
+
base = _rand_h.choice(rows).copy()
|
| 461 |
+
base["prompt"] = _rand_h.choice(inoc_prefixes) + base["prompt"]
|
| 462 |
+
rows.append(base)
|
| 463 |
+
print(f" V11 inoculation: +{n_inoc:,} prompts ({INOC_RATE*100:.0f}%)")
|
| 464 |
+
|
| 465 |
+
# 3. <effort> tag random (~30% rows) β GPT-5.5 effort dial β Frontier-Q2 #7
|
| 466 |
+
EFFORT_RATE = float(os.environ.get("EFFORT_TAG_RATE", "0.3"))
|
| 467 |
+
EFFORT_TIERS = ["none", "low", "medium", "high", "xhigh"]
|
| 468 |
+
n_effort = 0
|
| 469 |
+
for r in rows:
|
| 470 |
+
if _rand_h.random() < EFFORT_RATE:
|
| 471 |
+
r["prompt"] = f"<effort>{_rand_h.choice(EFFORT_TIERS)}</effort>\n" + r["prompt"]
|
| 472 |
+
n_effort += 1
|
| 473 |
+
print(f" V11 effort tags: {n_effort:,} rows ({EFFORT_RATE*100:.0f}%)")
|
| 474 |
|
| 475 |
raw = Dataset.from_list(rows)
|
| 476 |
# (Active-learning teachable filter applied AFTER model load β see below.
|
|
|
|
| 735 |
max_grad_norm=1.0, weight_decay=0.01,
|
| 736 |
gradient_checkpointing=True, # R8
|
| 737 |
gradient_checkpointing_kwargs={"use_reentrant": False},
|
| 738 |
+
# V11: NEFTune Ξ±=5 in pure SFT only. In DPO/RL phases drop to 0
|
| 739 |
+
# (Anti-halc-Q2 warning: NEFTune + factuality DPO degrades calibration).
|
| 740 |
+
neftune_noise_alpha=int(os.environ.get("NEFTUNE_ALPHA", "5")),
|
| 741 |
max_seq_length=SEQ_LEN,
|
| 742 |
packing=True, # R5
|
| 743 |
dataset_text_field="text",
|
|
|
|
| 788 |
try:
|
| 789 |
from trl import GRPOTrainer, GRPOConfig # type: ignore
|
| 790 |
print("βββ Phase 2: GRPO with execution rewards (experimental) βββ")
|
| 791 |
+
# V11: TruthRL TERNARY reward (arxiv 2509.25760).
|
| 792 |
+
# arxiv 2505.24630 warns: vanilla GRPO outcome-only INCREASES halc on
|
| 793 |
+
# reasoning models. TruthRL ternary fixes this:
|
| 794 |
+
# +1 truthful (verified pass)
|
| 795 |
+
# 0 abstain (model says "I don't know" β safe)
|
| 796 |
+
# -1 hallucinated (confident wrong / fake API / failed test)
|
| 797 |
+
# Result: -28.9% halc, +21.1% truthfulness vs vanilla.
|
| 798 |
+
import re, subprocess, tempfile
|
| 799 |
+
ABSTAIN_PHRASES = ["i don't know", "i'm not sure", "cannot determine",
|
| 800 |
+
"ΰΈΰΈ‘ΰΉΰΈ‘ΰΉΰΉΰΈΰΉΰΉΰΈ", "ΰΉΰΈ‘ΰΉΰΈΰΈ£ΰΈ²ΰΈ", "i would need", "unclear"]
|
| 801 |
+
FAKE_API_PATTERNS = [
|
| 802 |
+
r"AKIA[0-9A-Z]{15,}", # fake AWS keys
|
| 803 |
+
r"hf_[a-zA-Z0-9]{30,}", # fake HF tokens
|
| 804 |
+
r"sk-[a-zA-Z0-9]{30,}", # fake OpenAI/Anthropic
|
| 805 |
+
]
|
| 806 |
+
def reward_truthrl_ternary(prompts, completions, **kw):
|
| 807 |
rewards = []
|
| 808 |
for c in completions:
|
| 809 |
+
cl = c.lower()
|
| 810 |
+
# Abstain detection β 0 (safe, not penalized)
|
| 811 |
+
if any(p in cl for p in ABSTAIN_PHRASES):
|
| 812 |
+
rewards.append(0.0); continue
|
| 813 |
+
# Fake API/credential detection β -1 (clear halc)
|
| 814 |
+
if any(re.search(p, c) for p in FAKE_API_PATTERNS):
|
| 815 |
+
rewards.append(-1.0); continue
|
| 816 |
+
# Code execution check
|
| 817 |
m = re.search(r"```python\s*\n(.*?)\n```", c, re.S)
|
| 818 |
+
if m:
|
| 819 |
+
code = m.group(1)
|
| 820 |
+
try:
|
| 821 |
+
with tempfile.NamedTemporaryFile("w", suffix=".py", delete=False) as f:
|
| 822 |
+
f.write(code); pth = f.name
|
| 823 |
+
rc = subprocess.run(["python", "-c", f"exec(open('{pth}').read())"],
|
| 824 |
+
timeout=8, capture_output=True).returncode
|
| 825 |
+
rewards.append(1.0 if rc == 0 else -1.0)
|
| 826 |
+
except Exception:
|
| 827 |
+
rewards.append(-1.0)
|
| 828 |
+
continue
|
| 829 |
+
# No code β heuristic neutral (model didn't make claims to verify)
|
| 830 |
+
rewards.append(0.0)
|
| 831 |
return rewards
|
| 832 |
grpo_cfg = GRPOConfig(
|
| 833 |
output_dir="./surrogate-1-v1.2-research-grpo",
|
|
|
|
| 840 |
)
|
| 841 |
grpo = GRPOTrainer(
|
| 842 |
model=model, args=grpo_cfg,
|
| 843 |
+
reward_funcs=[reward_truthrl_ternary],
|
| 844 |
train_dataset=raw,
|
| 845 |
)
|
| 846 |
grpo.train()
|
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Extract real engineerβassistant turns from ~/.claude/projects/*.jsonl
|
| 3 |
+
session files into chunked text suitable for the distiller.
|
| 4 |
+
|
| 5 |
+
Each Claude Code session = a JSONL of message events. We extract pairs of
|
| 6 |
+
(user-text, assistant-text) where the assistant gave a substantive technical
|
| 7 |
+
answer. Output: text chunks (one per file) ready to pipe to distiller.py.
|
| 8 |
+
|
| 9 |
+
Skip:
|
| 10 |
+
- tool-use messages (we want the engineerβexpert dialogue)
|
| 11 |
+
- very short turns (<50 chars)
|
| 12 |
+
- turns where assistant just acknowledged or asked clarification
|
| 13 |
+
"""
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import argparse
|
| 17 |
+
import json
|
| 18 |
+
import sys
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
PROJECTS = Path.home() / ".claude/projects"
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def extract_pairs_from_session(jsonl_path: Path) -> list[tuple[str, str]]:
|
| 25 |
+
"""Return list of (user_msg, assistant_response) pairs."""
|
| 26 |
+
if not jsonl_path.exists():
|
| 27 |
+
return []
|
| 28 |
+
pairs = []
|
| 29 |
+
last_user = None
|
| 30 |
+
for L in jsonl_path.read_text(errors="replace").splitlines():
|
| 31 |
+
try:
|
| 32 |
+
ev = json.loads(L)
|
| 33 |
+
except Exception:
|
| 34 |
+
continue
|
| 35 |
+
msg = ev.get("message", {})
|
| 36 |
+
role = msg.get("role")
|
| 37 |
+
content = msg.get("content", "")
|
| 38 |
+
# Normalise content (sometimes it's a list of {type:text, text:..} blocks)
|
| 39 |
+
if isinstance(content, list):
|
| 40 |
+
text_parts = [b.get("text", "") for b in content
|
| 41 |
+
if isinstance(b, dict) and b.get("type") == "text"]
|
| 42 |
+
content = "\n".join(text_parts)
|
| 43 |
+
if not isinstance(content, str):
|
| 44 |
+
content = str(content)
|
| 45 |
+
if len(content) < 50:
|
| 46 |
+
continue
|
| 47 |
+
if role == "user":
|
| 48 |
+
last_user = content
|
| 49 |
+
elif role == "assistant" and last_user:
|
| 50 |
+
pairs.append((last_user, content))
|
| 51 |
+
last_user = None
|
| 52 |
+
return pairs
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def session_to_text(jsonl_path: Path, max_pair_chars: int = 4000) -> str:
|
| 56 |
+
"""Format a session's pairs as a single text blob for the distiller."""
|
| 57 |
+
pairs = extract_pairs_from_session(jsonl_path)
|
| 58 |
+
if not pairs:
|
| 59 |
+
return ""
|
| 60 |
+
out = [f"# Session: {jsonl_path.stem}", ""]
|
| 61 |
+
for u, a in pairs[:50]: # cap per session
|
| 62 |
+
u_clip = u[:max_pair_chars // 2]
|
| 63 |
+
a_clip = a[:max_pair_chars // 2]
|
| 64 |
+
out.append(f"## User\n{u_clip}\n\n## Assistant\n{a_clip}\n")
|
| 65 |
+
return "\n".join(out)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def main() -> int:
|
| 69 |
+
p = argparse.ArgumentParser()
|
| 70 |
+
p.add_argument("--out-dir", default=str(Path.home() / ".surrogate/state/v10-ingest/conversations/text"))
|
| 71 |
+
p.add_argument("--limit", type=int, default=0,
|
| 72 |
+
help="cap on number of session files (0 = all)")
|
| 73 |
+
p.add_argument("--min-pairs", type=int, default=2,
|
| 74 |
+
help="skip sessions with fewer than N user-assistant pairs")
|
| 75 |
+
args = p.parse_args()
|
| 76 |
+
|
| 77 |
+
out = Path(args.out_dir)
|
| 78 |
+
out.mkdir(parents=True, exist_ok=True)
|
| 79 |
+
|
| 80 |
+
files = sorted(PROJECTS.rglob("*.jsonl"))
|
| 81 |
+
if args.limit:
|
| 82 |
+
files = files[:args.limit]
|
| 83 |
+
print(f"scanning {len(files)} session files in {PROJECTS}")
|
| 84 |
+
|
| 85 |
+
n_written = 0
|
| 86 |
+
for fp in files:
|
| 87 |
+
try:
|
| 88 |
+
pairs = extract_pairs_from_session(fp)
|
| 89 |
+
except Exception as e:
|
| 90 |
+
sys.stderr.write(f" skip {fp.name}: {e}\n")
|
| 91 |
+
continue
|
| 92 |
+
if len(pairs) < args.min_pairs:
|
| 93 |
+
continue
|
| 94 |
+
txt = session_to_text(fp)
|
| 95 |
+
if not txt:
|
| 96 |
+
continue
|
| 97 |
+
out_file = out / (fp.parent.name + "__" + fp.stem + ".md")
|
| 98 |
+
out_file.write_text(txt)
|
| 99 |
+
n_written += 1
|
| 100 |
+
if n_written % 20 == 0:
|
| 101 |
+
print(f" wrote {n_written} session texts ({len(txt)} chars latest)")
|
| 102 |
+
|
| 103 |
+
print(f"DONE: wrote {n_written} session texts to {out}")
|
| 104 |
+
return 0
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
if __name__ == "__main__":
|
| 108 |
+
sys.exit(main())
|
|
@@ -0,0 +1,340 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Surrogate-1 V10 β universal distiller.
|
| 3 |
+
|
| 4 |
+
Takes any text input (markdown / .md / conversation chunk / agent definition /
|
| 5 |
+
SKILL.md / decision log / source code), distills via Cerebras β Groq β
|
| 6 |
+
OpenRouter free tier fallback chain into instruction/response pairs.
|
| 7 |
+
|
| 8 |
+
Output: JSONL with one Q&A per line, format:
|
| 9 |
+
{"prompt": "...", "response": "...", "source": "...", "kind": "..."}
|
| 10 |
+
|
| 11 |
+
CLI:
|
| 12 |
+
distiller.py --input file.md --kind knowledge --out pairs.jsonl
|
| 13 |
+
distiller.py --input dir/ --kind skill --out pairs.jsonl
|
| 14 |
+
distiller.py --stdin --kind conversation --out pairs.jsonl
|
| 15 |
+
"""
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
|
| 18 |
+
import argparse
|
| 19 |
+
import json
|
| 20 |
+
import os
|
| 21 |
+
import re
|
| 22 |
+
import sys
|
| 23 |
+
import time
|
| 24 |
+
import hashlib
|
| 25 |
+
from pathlib import Path
|
| 26 |
+
from urllib import request, error
|
| 27 |
+
|
| 28 |
+
# Free tier rotation
|
| 29 |
+
APIS = [
|
| 30 |
+
# Groq first (Cerebras is currently 429 overloaded). Multiple Groq models
|
| 31 |
+
# for round-robin across token quotas.
|
| 32 |
+
("groq-llama", "https://api.groq.com/openai/v1/chat/completions",
|
| 33 |
+
"GROQ_API_KEY", "llama-3.3-70b-versatile", 4000),
|
| 34 |
+
("groq-qwen", "https://api.groq.com/openai/v1/chat/completions",
|
| 35 |
+
"GROQ_API_KEY", "qwen/qwen3-32b", 4000),
|
| 36 |
+
("groq-gpt", "https://api.groq.com/openai/v1/chat/completions",
|
| 37 |
+
"GROQ_API_KEY", "openai/gpt-oss-120b", 4000),
|
| 38 |
+
# Cerebras (try after Groq β better quality but flakey availability)
|
| 39 |
+
("cerebras-qwen", "https://api.cerebras.ai/v1/chat/completions",
|
| 40 |
+
"CEREBRAS_API_KEY", "qwen-3-235b-a22b-instruct-2507", 4000),
|
| 41 |
+
("cerebras-gpt", "https://api.cerebras.ai/v1/chat/completions",
|
| 42 |
+
"CEREBRAS_API_KEY", "gpt-oss-120b", 4000),
|
| 43 |
+
# OpenRouter free fallback
|
| 44 |
+
("openrouter-llama", "https://openrouter.ai/api/v1/chat/completions",
|
| 45 |
+
"OPENROUTER_API_KEY", "meta-llama/llama-3.3-70b-instruct:free", 4000),
|
| 46 |
+
("openrouter-deepseek", "https://openrouter.ai/api/v1/chat/completions",
|
| 47 |
+
"OPENROUTER_API_KEY_2", "deepseek/deepseek-chat-v3.1:free", 4000),
|
| 48 |
+
]
|
| 49 |
+
|
| 50 |
+
# Per-kind distillation prompts
|
| 51 |
+
PROMPTS = {
|
| 52 |
+
"knowledge": """You are distilling a knowledge document into instruction/response training
|
| 53 |
+
pairs for a senior code+SRE+DevSecOps LLM (Surrogate-1).
|
| 54 |
+
|
| 55 |
+
Source ({source}):
|
| 56 |
+
\"\"\"
|
| 57 |
+
{text}
|
| 58 |
+
\"\"\"
|
| 59 |
+
|
| 60 |
+
Generate UP TO 8 high-quality Q&A pairs. Each pair:
|
| 61 |
+
- prompt: realistic engineer-asks-engineer question (concrete, not "what is X?")
|
| 62 |
+
- response: expert response, cite real APIs/CLIs/standards, runnable when applicable
|
| 63 |
+
- diverse: mix operational, architectural, debug, review angles
|
| 64 |
+
- 100-500 words per response
|
| 65 |
+
|
| 66 |
+
NO `<thinking>` blocks. NO hedging. NO "as an AI". NO generic intros.
|
| 67 |
+
Output STRICT JSONL only β one valid JSON per line:
|
| 68 |
+
{{"prompt":"...","response":"..."}}""",
|
| 69 |
+
|
| 70 |
+
"skill": """You are converting a SKILL.md file into skill demonstration training pairs
|
| 71 |
+
for Surrogate-1.
|
| 72 |
+
|
| 73 |
+
Source SKILL.md ({source}):
|
| 74 |
+
\"\"\"
|
| 75 |
+
{text}
|
| 76 |
+
\"\"\"
|
| 77 |
+
|
| 78 |
+
Generate 5-8 demonstration pairs showing the skill in action:
|
| 79 |
+
- prompt: a real user request that would invoke this skill
|
| 80 |
+
- response: complete expert demonstration of the skill (with the structured
|
| 81 |
+
output format the skill specifies, runnable when applicable)
|
| 82 |
+
- include edge cases + idiomatic usage
|
| 83 |
+
|
| 84 |
+
Output STRICT JSONL only:
|
| 85 |
+
{{"prompt":"...","response":"..."}}""",
|
| 86 |
+
|
| 87 |
+
"agent": """You are converting a Claude agent definition (.md file with frontmatter +
|
| 88 |
+
system prompt) into role-persona training pairs.
|
| 89 |
+
|
| 90 |
+
Agent definition ({source}):
|
| 91 |
+
\"\"\"
|
| 92 |
+
{text}
|
| 93 |
+
\"\"\"
|
| 94 |
+
|
| 95 |
+
Generate 8-12 pairs showing the agent's role behavior:
|
| 96 |
+
- prompt: realistic task that would be delegated to this agent
|
| 97 |
+
- response: agent's expert response IN-ROLE β using its specified tools/style/
|
| 98 |
+
output format. Include the agent's system prompt mentality.
|
| 99 |
+
|
| 100 |
+
Output STRICT JSONL only:
|
| 101 |
+
{{"prompt":"...","response":"..."}}""",
|
| 102 |
+
|
| 103 |
+
"decision": """You are extracting technique knowledge from an AI research decision log.
|
| 104 |
+
|
| 105 |
+
Decision log ({source}):
|
| 106 |
+
\"\"\"
|
| 107 |
+
{text}
|
| 108 |
+
\"\"\"
|
| 109 |
+
|
| 110 |
+
Generate 3-5 Q&A pairs that capture:
|
| 111 |
+
- what the technique is + when to use it + how to wire it
|
| 112 |
+
- expected gain + cost + risks
|
| 113 |
+
- comparable alternatives
|
| 114 |
+
|
| 115 |
+
Output STRICT JSONL only:
|
| 116 |
+
{{"prompt":"...","response":"..."}}""",
|
| 117 |
+
|
| 118 |
+
"conversation": """You are extracting (instruction, expert-response) training pairs from a
|
| 119 |
+
real engineerβassistant conversation transcript.
|
| 120 |
+
|
| 121 |
+
Transcript chunk ({source}):
|
| 122 |
+
\"\"\"
|
| 123 |
+
{text}
|
| 124 |
+
\"\"\"
|
| 125 |
+
|
| 126 |
+
Find moments where:
|
| 127 |
+
1. The engineer asked a real technical question or gave a real task
|
| 128 |
+
2. The assistant gave a useful expert answer (skip filler/clarification turns)
|
| 129 |
+
3. Skip moments where the assistant made the user angry or got it wrong
|
| 130 |
+
(you can tell from later corrections in the transcript)
|
| 131 |
+
|
| 132 |
+
Generate 3-8 pairs distilling the GOOD moments. Make them self-contained
|
| 133 |
+
(no "as we discussed" β restate the context).
|
| 134 |
+
|
| 135 |
+
Output STRICT JSONL only:
|
| 136 |
+
{{"prompt":"...","response":"..."}}""",
|
| 137 |
+
|
| 138 |
+
"memory": """You are converting a personal-knowledge memory file (lessons learned,
|
| 139 |
+
preferences, past mistakes/fixes) into training pairs that bake the lesson
|
| 140 |
+
into the model.
|
| 141 |
+
|
| 142 |
+
Memory file ({source}):
|
| 143 |
+
\"\"\"
|
| 144 |
+
{text}
|
| 145 |
+
\"\"\"
|
| 146 |
+
|
| 147 |
+
Generate 3-6 pairs where:
|
| 148 |
+
- prompt = the kind of question that would trigger the lesson
|
| 149 |
+
- response = the lesson applied correctly (including "don't do X" patterns
|
| 150 |
+
rephrased as "do Y instead because Z")
|
| 151 |
+
|
| 152 |
+
Output STRICT JSONL only:
|
| 153 |
+
{{"prompt":"...","response":"..."}}""",
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
def _hash(text: str) -> str:
|
| 158 |
+
return hashlib.sha256(text.encode()).hexdigest()[:16]
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def _call_api(api_name, url, key_env, model, max_tokens, prompt):
|
| 162 |
+
key = os.environ.get(key_env, "")
|
| 163 |
+
if not key:
|
| 164 |
+
return None, f"no key {key_env}"
|
| 165 |
+
body = {
|
| 166 |
+
"model": model,
|
| 167 |
+
"messages": [{"role": "user", "content": prompt}],
|
| 168 |
+
"max_tokens": max_tokens,
|
| 169 |
+
"temperature": 0.45,
|
| 170 |
+
}
|
| 171 |
+
try:
|
| 172 |
+
req = request.Request(
|
| 173 |
+
url,
|
| 174 |
+
data=json.dumps(body).encode(),
|
| 175 |
+
headers={"Authorization": f"Bearer {key}",
|
| 176 |
+
"Content-Type": "application/json",
|
| 177 |
+
"User-Agent": "surrogate-1-distiller/1.0",
|
| 178 |
+
"Accept": "application/json"},
|
| 179 |
+
)
|
| 180 |
+
with request.urlopen(req, timeout=45) as r:
|
| 181 |
+
d = json.loads(r.read().decode())
|
| 182 |
+
return d["choices"][0]["message"]["content"], None
|
| 183 |
+
except error.HTTPError as e:
|
| 184 |
+
try:
|
| 185 |
+
body_err = e.read().decode()[:300]
|
| 186 |
+
except Exception:
|
| 187 |
+
body_err = ""
|
| 188 |
+
return None, f"{api_name} HTTP {e.code}: {body_err[:200]}"
|
| 189 |
+
except Exception as e:
|
| 190 |
+
return None, f"{api_name} {type(e).__name__}: {str(e)[:200]}"
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
def call_frontier(prompt: str) -> str | None:
|
| 194 |
+
"""Try each API in order; return first success. On 429/rate-limit, skip
|
| 195 |
+
to next API immediately (no sleep β they have separate quotas)."""
|
| 196 |
+
errors = []
|
| 197 |
+
for api_name, url, key_env, model, max_tokens in APIS:
|
| 198 |
+
out, err = _call_api(api_name, url, key_env, model, max_tokens, prompt)
|
| 199 |
+
if out:
|
| 200 |
+
return out
|
| 201 |
+
if err:
|
| 202 |
+
errors.append(f"{api_name}: {err[:80]}")
|
| 203 |
+
sys.stderr.write(f"all {len(APIS)} APIs failed:\n " + "\n ".join(errors) + "\n")
|
| 204 |
+
return None
|
| 205 |
+
|
| 206 |
+
|
| 207 |
+
def parse_jsonl(text: str) -> list[dict]:
|
| 208 |
+
"""Extract valid JSON pairs from frontier output (it sometimes wraps with
|
| 209 |
+
code fences or adds prose)."""
|
| 210 |
+
pairs = []
|
| 211 |
+
for L in text.splitlines():
|
| 212 |
+
L = L.strip()
|
| 213 |
+
if not L or L.startswith("```") or L.startswith("//"):
|
| 214 |
+
continue
|
| 215 |
+
try:
|
| 216 |
+
j = json.loads(L)
|
| 217 |
+
if isinstance(j, dict) and "prompt" in j and "response" in j:
|
| 218 |
+
pairs.append(j)
|
| 219 |
+
except Exception:
|
| 220 |
+
continue
|
| 221 |
+
# also try multi-line { ... } blocks if line-based failed
|
| 222 |
+
if not pairs:
|
| 223 |
+
for m in re.finditer(r'\{\s*"prompt"\s*:.*?\}\s*(?=\{|\Z)', text, re.S):
|
| 224 |
+
try:
|
| 225 |
+
j = json.loads(m.group(0))
|
| 226 |
+
if "prompt" in j and "response" in j:
|
| 227 |
+
pairs.append(j)
|
| 228 |
+
except Exception:
|
| 229 |
+
continue
|
| 230 |
+
return pairs
|
| 231 |
+
|
| 232 |
+
|
| 233 |
+
def chunk_text(text: str, max_chars: int = 8000, overlap: int = 200) -> list[str]:
|
| 234 |
+
"""Split long text into overlapping chunks distillable in one API call."""
|
| 235 |
+
if len(text) <= max_chars:
|
| 236 |
+
return [text]
|
| 237 |
+
chunks = []
|
| 238 |
+
i = 0
|
| 239 |
+
while i < len(text):
|
| 240 |
+
chunk = text[i:i + max_chars]
|
| 241 |
+
# try to break on paragraph boundary near the end
|
| 242 |
+
if i + max_chars < len(text):
|
| 243 |
+
cut = chunk.rfind("\n\n")
|
| 244 |
+
if cut > max_chars // 2:
|
| 245 |
+
chunk = chunk[:cut]
|
| 246 |
+
chunks.append(chunk)
|
| 247 |
+
i += len(chunk) - overlap
|
| 248 |
+
return chunks
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
def distill_one(text: str, source: str, kind: str,
|
| 252 |
+
seen_hashes: set, out_handle) -> int:
|
| 253 |
+
"""Distill one text blob, write pairs to out_handle, return # pairs added."""
|
| 254 |
+
if kind not in PROMPTS:
|
| 255 |
+
sys.stderr.write(f"unknown kind: {kind}\n"); return 0
|
| 256 |
+
chunks = chunk_text(text)
|
| 257 |
+
n_added = 0
|
| 258 |
+
for chunk in chunks:
|
| 259 |
+
prompt = PROMPTS[kind].format(text=chunk, source=source)
|
| 260 |
+
raw = call_frontier(prompt)
|
| 261 |
+
if not raw:
|
| 262 |
+
continue
|
| 263 |
+
pairs = parse_jsonl(raw)
|
| 264 |
+
for p in pairs:
|
| 265 |
+
h = _hash(p["prompt"][:200] + p["response"][:200])
|
| 266 |
+
if h in seen_hashes:
|
| 267 |
+
continue
|
| 268 |
+
seen_hashes.add(h)
|
| 269 |
+
p["source"] = source
|
| 270 |
+
p["kind"] = kind
|
| 271 |
+
out_handle.write(json.dumps(p, ensure_ascii=False) + "\n")
|
| 272 |
+
out_handle.flush()
|
| 273 |
+
n_added += 1
|
| 274 |
+
return n_added
|
| 275 |
+
|
| 276 |
+
|
| 277 |
+
def main() -> int:
|
| 278 |
+
p = argparse.ArgumentParser()
|
| 279 |
+
p.add_argument("--input", help="file or dir; use --stdin for piped input")
|
| 280 |
+
p.add_argument("--stdin", action="store_true")
|
| 281 |
+
p.add_argument("--kind", required=True, choices=list(PROMPTS.keys()))
|
| 282 |
+
p.add_argument("--out", required=True, help="JSONL output path (append)")
|
| 283 |
+
p.add_argument("--limit", type=int, default=0,
|
| 284 |
+
help="cap on # files processed (0=all)")
|
| 285 |
+
p.add_argument("--source-tag", default="",
|
| 286 |
+
help="optional source label override")
|
| 287 |
+
args = p.parse_args()
|
| 288 |
+
|
| 289 |
+
out_path = Path(args.out)
|
| 290 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 291 |
+
|
| 292 |
+
# Resume support β load existing hashes
|
| 293 |
+
seen = set()
|
| 294 |
+
if out_path.exists():
|
| 295 |
+
for L in out_path.read_text().splitlines():
|
| 296 |
+
try:
|
| 297 |
+
j = json.loads(L)
|
| 298 |
+
seen.add(_hash(j["prompt"][:200] + j["response"][:200]))
|
| 299 |
+
except Exception:
|
| 300 |
+
continue
|
| 301 |
+
|
| 302 |
+
f_out = out_path.open("a")
|
| 303 |
+
n_total = 0
|
| 304 |
+
|
| 305 |
+
if args.stdin:
|
| 306 |
+
text = sys.stdin.read()
|
| 307 |
+
n = distill_one(text, args.source_tag or "stdin", args.kind, seen, f_out)
|
| 308 |
+
n_total += n
|
| 309 |
+
print(f" stdin β +{n} pairs")
|
| 310 |
+
elif Path(args.input).is_file():
|
| 311 |
+
text = Path(args.input).read_text(errors="replace")
|
| 312 |
+
n = distill_one(text, args.source_tag or args.input, args.kind, seen, f_out)
|
| 313 |
+
n_total += n
|
| 314 |
+
print(f" {args.input} β +{n} pairs")
|
| 315 |
+
elif Path(args.input).is_dir():
|
| 316 |
+
files = sorted(Path(args.input).rglob("*.md"))
|
| 317 |
+
if args.limit:
|
| 318 |
+
files = files[:args.limit]
|
| 319 |
+
for i, fp in enumerate(files):
|
| 320 |
+
try:
|
| 321 |
+
text = fp.read_text(errors="replace")
|
| 322 |
+
except Exception as e:
|
| 323 |
+
sys.stderr.write(f" skip {fp}: {e}\n")
|
| 324 |
+
continue
|
| 325 |
+
if len(text) < 100:
|
| 326 |
+
continue
|
| 327 |
+
n = distill_one(text, str(fp), args.kind, seen, f_out)
|
| 328 |
+
n_total += n
|
| 329 |
+
print(f" [{i+1}/{len(files)}] {fp.name} β +{n} (total {n_total})")
|
| 330 |
+
else:
|
| 331 |
+
sys.stderr.write(f"--input not found: {args.input}\n")
|
| 332 |
+
return 2
|
| 333 |
+
|
| 334 |
+
f_out.close()
|
| 335 |
+
print(f"DONE: total {n_total} pairs written to {out_path}")
|
| 336 |
+
return 0
|
| 337 |
+
|
| 338 |
+
|
| 339 |
+
if __name__ == "__main__":
|
| 340 |
+
sys.exit(main())
|
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Surrogate-1 V10 β feature-request extractor.
|
| 3 |
+
|
| 4 |
+
Read all 748 past Claude conversation .jsonl files in ~/.claude/projects/
|
| 5 |
+
and extract every moment where the user asked Claude to BUILD a feature
|
| 6 |
+
in Surrogate. These become FEATURE-DEMONSTRATION training pairs that
|
| 7 |
+
get distilled into Surrogate's weights β so the model itself knows how
|
| 8 |
+
to build the feature when asked again.
|
| 9 |
+
|
| 10 |
+
Format of output JSONL (consumed by trainer via merge_external):
|
| 11 |
+
{
|
| 12 |
+
"prompt": "<user feature request, made self-contained>",
|
| 13 |
+
"response": "<concrete implementation: code/diff/spec/etc>",
|
| 14 |
+
"kind": "feature-build",
|
| 15 |
+
"source": "<session-id>",
|
| 16 |
+
"feature_class": "<heuristic-tag>"
|
| 17 |
+
}
|
| 18 |
+
|
| 19 |
+
Heuristic feature-class tags pulled from the user's actual asks:
|
| 20 |
+
- "agent-coding-loop" β autonomous coding agent / dev chain
|
| 21 |
+
- "monitoring" β 24/7 monitor, watchdog, observability
|
| 22 |
+
- "training-technique" β new technique to add to trainer (RL, DPO, etc.)
|
| 23 |
+
- "orchestration" β multi-agent, spawn/aggregate, parallel
|
| 24 |
+
- "knowledge-ingest" β distill X into model
|
| 25 |
+
- "self-improvement" β closed-loop improvement, flywheel
|
| 26 |
+
- "release-pipeline" β autonomous release, CI/CD, draft PR
|
| 27 |
+
- "incident-response" β auto-heal, diagnose, fix
|
| 28 |
+
- "evaluation" β bench, scoring, verdict
|
| 29 |
+
- "cost-efficiency" β frontier-style smarter-with-less
|
| 30 |
+
"""
|
| 31 |
+
from __future__ import annotations
|
| 32 |
+
|
| 33 |
+
import argparse
|
| 34 |
+
import json
|
| 35 |
+
import re
|
| 36 |
+
import sys
|
| 37 |
+
from pathlib import Path
|
| 38 |
+
|
| 39 |
+
PROJECTS = Path.home() / ".claude/projects"
|
| 40 |
+
|
| 41 |
+
# Heuristic patterns β match Thai + English request-style language
|
| 42 |
+
REQUEST_PATTERNS = [
|
| 43 |
+
# Direct imperative "build/make/add/create X"
|
| 44 |
+
r"\b(build|make|add|create|implement|wire|train|bake|ingest|distill)\s+\w+",
|
| 45 |
+
# Thai imperatives
|
| 46 |
+
r"(ΰΈΰΈ³|ΰΈͺΰΈ£ΰΉΰΈ²ΰΈ|ΰΉΰΈΰΈ΄ΰΉΰΈ‘|implement|train|ΰΉΰΈΰΈ£ΰΈ|fine-?tune|fine_tune)\s+",
|
| 47 |
+
# Feature-shaped asks
|
| 48 |
+
r"feature\s+",
|
| 49 |
+
r"(can|please|let|let's|how about)\s+",
|
| 50 |
+
# explicit "I want X" / "ΰΈΰΈ‘ΰΈΰΈ’ΰΈ²ΰΈ"
|
| 51 |
+
r"(I want|I need|I'd like|ΰΈΰΈ‘ΰΈΰΈ’ΰΈ²ΰΈ|ΰΈΰΈ‘ΰΈΰΉΰΈΰΈΰΈΰΈ²ΰΈ£|ΰΈΰΈ’ΰΈ²ΰΈ|ΰΈΰΉΰΈΰΈΰΈΰΈ²ΰΈ£)",
|
| 52 |
+
# research-asks that lead to features
|
| 53 |
+
r"(go research|research|find out|ΰΈ«ΰΈ²ΰΈ‘ΰΈ²|ΰΈ₯ΰΈΰΈΰΈ«ΰΈ²|ΰΈ§ΰΈ΄ΰΈΰΈ±ΰΈ’)",
|
| 54 |
+
]
|
| 55 |
+
|
| 56 |
+
FEATURE_CLASS_KEYWORDS = {
|
| 57 |
+
"agent-coding-loop": ["agent chain", "dev agent", "autonomous coding", "code 24", "spawn agent", "sub-?agent", "team agent"],
|
| 58 |
+
"monitoring": ["monitor", "watchdog", "observability", "smoke", "health check", "anomaly", "incident detect"],
|
| 59 |
+
"training-technique": ["GRPO", "DPO", "DAPO", "ORPO", "KTO", "RLVR", "SimPO", "Constitutional AI", "TruthRL", "Mask-DPO", "PiSSA", "LoftQ", "CorDA", "DoRA", "RSLoRA", "NEFTune", "Spectrum", "Quiet-STaR", "Reflexion", "Voyager", "Magpie", "self-rewarding", "knowledge distillation", "DistillKit", "active learning", "CoT", "PRM", "verify"],
|
| 60 |
+
"orchestration": ["multi-agent", "spawn", "aggregate", "parallel", "orchestrat", "team", "shared context", "subagent", "dispatch"],
|
| 61 |
+
"knowledge-ingest": ["ingest", "distill", "corpus", "training data", "training pair", "Q&A", "vault", "obsidian", "memory"],
|
| 62 |
+
"self-improvement": ["self-improve", "flywheel", "improve over time", "ΰΉΰΈΰΉΰΈΰΈΰΈΆΰΉΰΈ", "online RL", "continual learn"],
|
| 63 |
+
"release-pipeline": ["release", "draft PR", "CI/CD", "deploy", "MVP", "v1 v2", "v10000", "ship feature", "auto-release"],
|
| 64 |
+
"incident-response": ["incident", "auto-heal", "rollback", "diagnose", "patch", "remediate", "fix"],
|
| 65 |
+
"evaluation": ["bench", "eval", "score", "verdict", "rubric", "HumanEval", "MBPP", "BFCL", "SWE-Bench", "test", "ΰΈ§ΰΈ±ΰΈΰΈΰΈ₯"],
|
| 66 |
+
"cost-efficiency": ["smarter with less", "frontier", "efficiency", "speculative", "MoE", "sliding window", "test-time compute", "ΰΉΰΈΰΉ resource ΰΈΰΉΰΈΰΈ’"],
|
| 67 |
+
"role-persona": ["SRE", "DevSecOps", "Full Stack", "PM", "PO", "SA", "BD", "QE", "AI Engineer", "Marketing", "role", "persona", "engineer"],
|
| 68 |
+
"long-context": ["long context", "32K", "64K", "128K", "1M", "YaRN", "RoPE", "context window"],
|
| 69 |
+
"anti-hallucination": ["halluc", "ΰΈ«ΰΈ₯ΰΈΰΈ", "factual", "truth", "correctness", "F-DPO", "TruthRL", "calibration"],
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def classify(text: str) -> str:
|
| 74 |
+
text_lower = text.lower()
|
| 75 |
+
scores = {}
|
| 76 |
+
for cls, kws in FEATURE_CLASS_KEYWORDS.items():
|
| 77 |
+
for kw in kws:
|
| 78 |
+
if kw.lower() in text_lower:
|
| 79 |
+
scores[cls] = scores.get(cls, 0) + 1
|
| 80 |
+
if not scores:
|
| 81 |
+
return "general"
|
| 82 |
+
return max(scores, key=scores.get)
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def looks_like_feature_request(user_msg: str) -> bool:
|
| 86 |
+
"""True if the user message is asking for something to be built."""
|
| 87 |
+
if len(user_msg) < 30:
|
| 88 |
+
return False
|
| 89 |
+
if len(user_msg) > 4000:
|
| 90 |
+
# Very long messages are usually critique/anger, not feature requests
|
| 91 |
+
# but they often contain feature requests inside
|
| 92 |
+
pass
|
| 93 |
+
for pat in REQUEST_PATTERNS:
|
| 94 |
+
if re.search(pat, user_msg, re.I):
|
| 95 |
+
return True
|
| 96 |
+
return False
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def extract_pairs_from_session(jsonl_path: Path) -> list[dict]:
|
| 100 |
+
"""For each (user, assistant) pair where the user asked for a feature,
|
| 101 |
+
return a training-ready dict."""
|
| 102 |
+
out = []
|
| 103 |
+
last_user = None
|
| 104 |
+
for L in jsonl_path.read_text(errors="replace").splitlines():
|
| 105 |
+
try:
|
| 106 |
+
ev = json.loads(L)
|
| 107 |
+
except Exception:
|
| 108 |
+
continue
|
| 109 |
+
msg = ev.get("message", {})
|
| 110 |
+
role = msg.get("role")
|
| 111 |
+
content = msg.get("content", "")
|
| 112 |
+
if isinstance(content, list):
|
| 113 |
+
content = "\n".join(b.get("text", "") for b in content
|
| 114 |
+
if isinstance(b, dict) and b.get("type") == "text")
|
| 115 |
+
if not isinstance(content, str):
|
| 116 |
+
content = str(content)
|
| 117 |
+
if role == "user":
|
| 118 |
+
last_user = content
|
| 119 |
+
elif role == "assistant" and last_user:
|
| 120 |
+
if looks_like_feature_request(last_user):
|
| 121 |
+
# Make prompt self-contained β restate context briefly
|
| 122 |
+
# Skip if assistant response is too short or just clarification
|
| 123 |
+
if len(content) >= 100 and not content.startswith("?"):
|
| 124 |
+
out.append({
|
| 125 |
+
"prompt": last_user[:3000],
|
| 126 |
+
"response": content[:6000],
|
| 127 |
+
"kind": "feature-build",
|
| 128 |
+
"source": jsonl_path.stem,
|
| 129 |
+
"feature_class": classify(last_user + " " + content),
|
| 130 |
+
})
|
| 131 |
+
last_user = None
|
| 132 |
+
return out
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
def main() -> int:
|
| 136 |
+
p = argparse.ArgumentParser()
|
| 137 |
+
p.add_argument("--out", default=str(Path.home() / ".surrogate/state/v10-ingest/conversations/feature-builds.jsonl"))
|
| 138 |
+
p.add_argument("--limit", type=int, default=0)
|
| 139 |
+
args = p.parse_args()
|
| 140 |
+
|
| 141 |
+
out_path = Path(args.out)
|
| 142 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 143 |
+
f_out = out_path.open("w")
|
| 144 |
+
n_total = 0
|
| 145 |
+
n_files = 0
|
| 146 |
+
by_class = {}
|
| 147 |
+
files = sorted(PROJECTS.rglob("*.jsonl"))
|
| 148 |
+
if args.limit:
|
| 149 |
+
files = files[:args.limit]
|
| 150 |
+
print(f"scanning {len(files)} session files for feature-build pairs...")
|
| 151 |
+
for fp in files:
|
| 152 |
+
try:
|
| 153 |
+
pairs = extract_pairs_from_session(fp)
|
| 154 |
+
except Exception as e:
|
| 155 |
+
sys.stderr.write(f" skip {fp.name}: {e}\n")
|
| 156 |
+
continue
|
| 157 |
+
if not pairs:
|
| 158 |
+
continue
|
| 159 |
+
n_files += 1
|
| 160 |
+
for j in pairs:
|
| 161 |
+
f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
|
| 162 |
+
by_class[j["feature_class"]] = by_class.get(j["feature_class"], 0) + 1
|
| 163 |
+
n_total += 1
|
| 164 |
+
if n_files % 50 == 0:
|
| 165 |
+
print(f" scanned {n_files} sessions, extracted {n_total} feature-build pairs")
|
| 166 |
+
f_out.close()
|
| 167 |
+
print(f"\nDONE: {n_total} feature-build pairs from {n_files} sessions β {out_path}")
|
| 168 |
+
print("By feature class:")
|
| 169 |
+
for cls, n in sorted(by_class.items(), key=lambda x: -x[1]):
|
| 170 |
+
print(f" {n:5d} {cls}")
|
| 171 |
+
return 0
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
if __name__ == "__main__":
|
| 175 |
+
sys.exit(main())
|
|
@@ -0,0 +1,180 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 V10 β orchestrate ALL ingestion sources β distill β push HF.
|
| 3 |
+
#
|
| 4 |
+
# Sources (audit found 715+ artifacts, 0% had been ingested):
|
| 5 |
+
# - Obsidian Vault 591 .md files
|
| 6 |
+
# - .claude/memory 27 files
|
| 7 |
+
# - SKILL.md 68 files
|
| 8 |
+
# - Claude agents 25 files (6 user + 19 plugin)
|
| 9 |
+
# - arkship decisions 31 files
|
| 10 |
+
# - past conversations 748 jsonl files = 647 MB of dialogue
|
| 11 |
+
#
|
| 12 |
+
# Strategy: each source kind β its own JSONL β its own HF dataset
|
| 13 |
+
# (so V10 trainer can weight them differently via merge_external).
|
| 14 |
+
#
|
| 15 |
+
# Free-tier API rotation handles rate limits across Cerebras+Groq+OpenRouter.
|
| 16 |
+
# All output is checkpointed (resume-friendly): re-running won't redo done work.
|
| 17 |
+
set -uo pipefail
|
| 18 |
+
[[ -f "$HOME/.hermes/.env" ]] && { set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a; }
|
| 19 |
+
|
| 20 |
+
V3="$HOME/.surrogate/hf-space/bin/v3"
|
| 21 |
+
ING="$HOME/.surrogate/state/v10-ingest"
|
| 22 |
+
LOG="$HOME/.surrogate/logs/ingest-all.log"
|
| 23 |
+
mkdir -p "$ING" "$(dirname "$LOG")"
|
| 24 |
+
|
| 25 |
+
log() { echo "[$(date '+%Y-%m-%dT%H:%M:%S')] $*" | tee -a "$LOG"; }
|
| 26 |
+
|
| 27 |
+
notify() {
|
| 28 |
+
[[ -z "${DISCORD_WEBHOOK:-}" ]] && return
|
| 29 |
+
curl -s -X POST -H "Content-Type: application/json" \
|
| 30 |
+
-d "$(python3 -c "import json,sys; print(json.dumps({'content': sys.argv[1]}))" "$1")" \
|
| 31 |
+
"$DISCORD_WEBHOOK" >/dev/null 2>&1 || true
|
| 32 |
+
}
|
| 33 |
+
|
| 34 |
+
ingest_dir() {
|
| 35 |
+
local src_dir="$1" kind="$2" out_jsonl="$3" repo="$4" limit="${5:-0}"
|
| 36 |
+
log ""
|
| 37 |
+
log "ββ INGEST: $kind from $src_dir ββ"
|
| 38 |
+
[[ ! -d "$src_dir" ]] && { log " skip: $src_dir not a dir"; return 0; }
|
| 39 |
+
local extra=""
|
| 40 |
+
[[ "$limit" != "0" ]] && extra="--limit $limit"
|
| 41 |
+
python3 "$V3/distiller.py" --input "$src_dir" --kind "$kind" \
|
| 42 |
+
--out "$out_jsonl" $extra 2>&1 | tee -a "$LOG" | tail -10
|
| 43 |
+
if [[ -s "$out_jsonl" ]]; then
|
| 44 |
+
local n; n=$(wc -l < "$out_jsonl" | tr -d ' ')
|
| 45 |
+
log " β $n pairs in $out_jsonl"
|
| 46 |
+
if (( n > 50 )); then
|
| 47 |
+
log " pushing β $repo"
|
| 48 |
+
python3 "$V3/push-hf-dataset.py" --input "$out_jsonl" --repo "$repo" 2>&1 | tail -5 | tee -a "$LOG"
|
| 49 |
+
fi
|
| 50 |
+
fi
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
ingest_file() {
|
| 54 |
+
local src="$1" kind="$2" out_jsonl="$3" repo="$4"
|
| 55 |
+
[[ ! -f "$src" ]] && return 0
|
| 56 |
+
log ""
|
| 57 |
+
log "ββ INGEST FILE: $kind from $src ββ"
|
| 58 |
+
python3 "$V3/distiller.py" --input "$src" --kind "$kind" \
|
| 59 |
+
--out "$out_jsonl" --source-tag "$src" 2>&1 | tail -3
|
| 60 |
+
[[ -s "$out_jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
|
| 61 |
+
--input "$out_jsonl" --repo "$repo" 2>&1 | tail -3
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
log "βββ V10 INGEST PIPELINE STARTING βββ"
|
| 65 |
+
notify "π V10 ingest pipeline starting β 715+ artifacts β 7 HF datasets"
|
| 66 |
+
|
| 67 |
+
# ββ 1. Conversations: extract first, then distill ββββββββββββββββββββββββββ
|
| 68 |
+
log ""
|
| 69 |
+
log "ββ STEP 1: extract past conversations ββ"
|
| 70 |
+
python3 "$V3/conversation-extractor.py" --limit 0 --min-pairs 3 2>&1 | tail -5 | tee -a "$LOG"
|
| 71 |
+
log "ββ distill conversations ββ"
|
| 72 |
+
ingest_dir "$ING/conversations/text" "conversation" \
|
| 73 |
+
"$ING/conversations/pairs.jsonl" \
|
| 74 |
+
"axentx/surrogate-1-conversations" 0
|
| 75 |
+
|
| 76 |
+
# ββ 2. Obsidian Vault knowledge βββββββββββββββββββββββββββββββββββββββββββββ
|
| 77 |
+
ingest_dir "$HOME/Documents/Obsidian Vault/AI-Hub/knowledge" "knowledge" \
|
| 78 |
+
"$ING/vault/knowledge.jsonl" \
|
| 79 |
+
"axentx/surrogate-1-knowledge-vault" 0
|
| 80 |
+
|
| 81 |
+
ingest_dir "$HOME/Documents/Obsidian Vault/AI-Hub/patterns" "knowledge" \
|
| 82 |
+
"$ING/vault/patterns.jsonl" \
|
| 83 |
+
"axentx/surrogate-1-knowledge-patterns" 0
|
| 84 |
+
|
| 85 |
+
# ββ 3. .claude/memory βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 86 |
+
ingest_dir "$HOME/.claude/memory" "memory" \
|
| 87 |
+
"$ING/memory/lessons.jsonl" \
|
| 88 |
+
"axentx/surrogate-1-knowledge-memory" 0
|
| 89 |
+
|
| 90 |
+
# ββ 4. SKILL.md mirror (Anthropic + community) ββββββββββββββββββββββββββββββ
|
| 91 |
+
log ""
|
| 92 |
+
log "ββ INGEST: skills (Anthropic + community SKILL.md) ββ"
|
| 93 |
+
mkdir -p "$ING/skills"
|
| 94 |
+
find "$HOME/Documents/Obsidian Vault/AI-Hub/skills" \
|
| 95 |
+
"$HOME/.claude/skills" \
|
| 96 |
+
"$HOME/.claude/plugins/cache" \
|
| 97 |
+
-name "SKILL.md" -type f 2>/dev/null \
|
| 98 |
+
| while read -r f; do
|
| 99 |
+
python3 "$V3/distiller.py" --input "$f" --kind "skill" \
|
| 100 |
+
--out "$ING/skills/skills.jsonl" --source-tag "$f" 2>&1 | tail -2
|
| 101 |
+
done | tee -a "$LOG" | tail -20
|
| 102 |
+
[[ -s "$ING/skills/skills.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
|
| 103 |
+
--input "$ING/skills/skills.jsonl" \
|
| 104 |
+
--repo "axentx/surrogate-1-skills-mirror" 2>&1 | tail -3
|
| 105 |
+
|
| 106 |
+
# ββ 5. Claude agents (user + plugin) ββββββββββββββββββββββββββββββββββββββββ
|
| 107 |
+
log ""
|
| 108 |
+
log "ββ INGEST: agents (user + plugin) ββ"
|
| 109 |
+
mkdir -p "$ING/agents"
|
| 110 |
+
{
|
| 111 |
+
find "$HOME/.claude/agents" -name "*.md" -type f 2>/dev/null
|
| 112 |
+
find "$HOME/.claude/plugins" -path "*agents*" -name "*.md" -type f 2>/dev/null
|
| 113 |
+
} | sort -u | while read -r f; do
|
| 114 |
+
python3 "$V3/distiller.py" --input "$f" --kind "agent" \
|
| 115 |
+
--out "$ING/agents/personas.jsonl" --source-tag "$f" 2>&1 | tail -2
|
| 116 |
+
done | tee -a "$LOG" | tail -10
|
| 117 |
+
[[ -s "$ING/agents/personas.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
|
| 118 |
+
--input "$ING/agents/personas.jsonl" \
|
| 119 |
+
--repo "axentx/surrogate-1-roles-claude-builtin" 2>&1 | tail -3
|
| 120 |
+
|
| 121 |
+
# ββ 6. arkship decisions (technique research) βββββββββββββββββββββββββββββββ
|
| 122 |
+
ingest_dir "$HOME/axentx/arkship/decisions" "decision" \
|
| 123 |
+
"$ING/decisions/techniques.jsonl" \
|
| 124 |
+
"axentx/surrogate-1-arkship-decisions" 0
|
| 125 |
+
|
| 126 |
+
# ββ 7. all axentx project decisions (Costinel/Vanguard/etc) ββββββββββββββββ
|
| 127 |
+
log ""
|
| 128 |
+
log "ββ INGEST: axentx project decisions (cross-project) ββ"
|
| 129 |
+
mkdir -p "$ING/decisions"
|
| 130 |
+
find "$HOME/axentx" -maxdepth 4 -name "decisions" -type d 2>/dev/null \
|
| 131 |
+
| while read -r d; do
|
| 132 |
+
log " scanning $d"
|
| 133 |
+
python3 "$V3/distiller.py" --input "$d" --kind "decision" \
|
| 134 |
+
--out "$ING/decisions/all-axentx.jsonl" --source-tag "$d" 2>&1 | tail -3
|
| 135 |
+
done | tee -a "$LOG" | tail -20
|
| 136 |
+
[[ -s "$ING/decisions/all-axentx.jsonl" ]] && python3 "$V3/push-hf-dataset.py" \
|
| 137 |
+
--input "$ING/decisions/all-axentx.jsonl" \
|
| 138 |
+
--repo "axentx/surrogate-1-axentx-decisions" 2>&1 | tail -3
|
| 139 |
+
|
| 140 |
+
# ββ 8. surrogate trainer history (160 commits) β meta-learning corpus ββββββ
|
| 141 |
+
log ""
|
| 142 |
+
log "ββ INGEST: hf-space self-development history ββ"
|
| 143 |
+
mkdir -p "$ING/self-trace"
|
| 144 |
+
cd "$HOME/.surrogate/hf-space" 2>/dev/null && {
|
| 145 |
+
git log --pretty=format:"%H%n%an%n%aI%n%s%n%n%b%n---END---" \
|
| 146 |
+
--since="60 days ago" --no-merges \
|
| 147 |
+
> "$ING/self-trace/hf-space-history.txt" 2>/dev/null
|
| 148 |
+
log " hf-space history dumped: $(wc -l < "$ING/self-trace/hf-space-history.txt" | tr -d ' ') lines"
|
| 149 |
+
}
|
| 150 |
+
[[ -s "$ING/self-trace/hf-space-history.txt" ]] && python3 "$V3/distiller.py" \
|
| 151 |
+
--input "$ING/self-trace/hf-space-history.txt" \
|
| 152 |
+
--kind "decision" \
|
| 153 |
+
--out "$ING/self-trace/dev-trace.jsonl" \
|
| 154 |
+
--source-tag "hf-space-git-history" 2>&1 | tail -5
|
| 155 |
+
|
| 156 |
+
# ββ final summary + push merged ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 157 |
+
log ""
|
| 158 |
+
log "βββ V10 INGEST SUMMARY βββ"
|
| 159 |
+
total=0
|
| 160 |
+
for f in "$ING"/*/*.jsonl; do
|
| 161 |
+
[[ -s "$f" ]] || continue
|
| 162 |
+
n=$(wc -l < "$f" | tr -d ' ')
|
| 163 |
+
total=$((total + n))
|
| 164 |
+
log " $(basename $(dirname "$f"))/$(basename "$f"): $n pairs"
|
| 165 |
+
done
|
| 166 |
+
log " ββββββββββββββββββββββββ"
|
| 167 |
+
log " TOTAL: $total pairs across all sinks"
|
| 168 |
+
|
| 169 |
+
# build merged for trainer convenience
|
| 170 |
+
mkdir -p "$ING/merged"
|
| 171 |
+
cat "$ING"/*/*.jsonl 2>/dev/null > "$ING/merged/all.jsonl" 2>/dev/null
|
| 172 |
+
if [[ -s "$ING/merged/all.jsonl" ]]; then
|
| 173 |
+
log " merged total: $(wc -l < "$ING/merged/all.jsonl" | tr -d ' ') pairs"
|
| 174 |
+
python3 "$V3/push-hf-dataset.py" \
|
| 175 |
+
--input "$ING/merged/all.jsonl" \
|
| 176 |
+
--repo "axentx/surrogate-1-v10-merged" 2>&1 | tail -3
|
| 177 |
+
fi
|
| 178 |
+
|
| 179 |
+
notify "β
V10 ingest done β $total pairs across 7 HF datasets, ready for V10 trainer"
|
| 180 |
+
log "βββ DONE βββ"
|
|
@@ -0,0 +1,373 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Surrogate-1 V10 β Kaggle ingest kernel.
|
| 3 |
+
|
| 4 |
+
Runs ENTIRELY on Kaggle (T4Γ2 free). NO Mac compute.
|
| 5 |
+
|
| 6 |
+
Pipeline:
|
| 7 |
+
1. Pull source bundle from axentx/surrogate-1-v10-source-bundle (tar.gz)
|
| 8 |
+
2. Extract per-kind source files (vault, memory, skills, agents, decisions,
|
| 9 |
+
conversations, feature-builds)
|
| 10 |
+
3. Distill each via Cerebras β Groq β OpenRouter free chain
|
| 11 |
+
4. Push results to 7+ axentx/surrogate-1-knowledge-* HF datasets
|
| 12 |
+
|
| 13 |
+
Setup in Kaggle UI:
|
| 14 |
+
- Add-ons β Secrets β attach: HF_TOKEN, CEREBRAS_API_KEY, GROQ_API_KEY,
|
| 15 |
+
OPENROUTER_API_KEY (all already attached from V8 trainer)
|
| 16 |
+
|
| 17 |
+
Outputs (these axentx/* datasets will be CREATED if missing β fixes the
|
| 18 |
+
404s the audit found):
|
| 19 |
+
axentx/surrogate-1-knowledge-vault β Obsidian Vault distilled Q&A
|
| 20 |
+
axentx/surrogate-1-knowledge-memory β .claude/memory lessons
|
| 21 |
+
axentx/surrogate-1-knowledge-patterns β Vault/patterns/
|
| 22 |
+
axentx/surrogate-1-skills-mirror β 68 SKILL.md demonstrations
|
| 23 |
+
axentx/surrogate-1-roles-claude-builtin β 25 agent role personas
|
| 24 |
+
axentx/surrogate-1-arkship-decisions β 31 technique research
|
| 25 |
+
axentx/surrogate-1-axentx-decisions β cross-project decisions
|
| 26 |
+
axentx/surrogate-1-conversations β past conversation Q&A
|
| 27 |
+
axentx/surrogate-1-feature-builds β extracted feature requests
|
| 28 |
+
+ their expert build responses
|
| 29 |
+
|
| 30 |
+
Pickup by V10 trainer: kaggle-trainer.sh's merge_external() calls already
|
| 31 |
+
support these dataset names β no trainer changes needed.
|
| 32 |
+
|
| 33 |
+
Usage:
|
| 34 |
+
Save Version on Kaggle. Run. Output appears in HF datasets.
|
| 35 |
+
"""
|
| 36 |
+
from __future__ import annotations
|
| 37 |
+
|
| 38 |
+
import os
|
| 39 |
+
import json
|
| 40 |
+
import re
|
| 41 |
+
import sys
|
| 42 |
+
import time
|
| 43 |
+
import tarfile
|
| 44 |
+
import hashlib
|
| 45 |
+
import subprocess
|
| 46 |
+
from pathlib import Path
|
| 47 |
+
from urllib import request, error
|
| 48 |
+
|
| 49 |
+
# ββ Bootstrap Kaggle Secrets β env (Kaggle Secrets aren't auto-injected) ββββ
|
| 50 |
+
try:
|
| 51 |
+
from kaggle_secrets import UserSecretsClient
|
| 52 |
+
_us = UserSecretsClient()
|
| 53 |
+
for k in ("HF_TOKEN", "CEREBRAS_API_KEY", "GROQ_API_KEY",
|
| 54 |
+
"OPENROUTER_API_KEY", "OPENROUTER_API_KEY_2", "ANTHROPIC_API_KEY"):
|
| 55 |
+
if k not in os.environ:
|
| 56 |
+
try: os.environ[k] = _us.get_secret(k)
|
| 57 |
+
except Exception: pass
|
| 58 |
+
print("Kaggle secrets bootstrapped:",
|
| 59 |
+
[k for k in ("HF_TOKEN","CEREBRAS_API_KEY","GROQ_API_KEY",
|
| 60 |
+
"OPENROUTER_API_KEY") if os.environ.get(k)])
|
| 61 |
+
except ImportError:
|
| 62 |
+
print("not on Kaggle β env must come from shell")
|
| 63 |
+
|
| 64 |
+
# Install deps
|
| 65 |
+
print("Installing deps...")
|
| 66 |
+
subprocess.run([sys.executable, "-m", "pip", "install", "-q",
|
| 67 |
+
"huggingface_hub", "datasets"], check=False)
|
| 68 |
+
|
| 69 |
+
from huggingface_hub import HfApi, create_repo, snapshot_download
|
| 70 |
+
|
| 71 |
+
WORK = Path("/kaggle/working/v10-ingest")
|
| 72 |
+
WORK.mkdir(parents=True, exist_ok=True)
|
| 73 |
+
SRC_BUNDLE_REPO = "axentx/surrogate-1-v10-source-bundle"
|
| 74 |
+
SRC_BUNDLE_FILE = "bundle.tar.gz"
|
| 75 |
+
|
| 76 |
+
# ββ Free-tier API rotation (User-Agent fix to avoid 403) ββββββββββββββββββββ
|
| 77 |
+
APIS = [
|
| 78 |
+
("groq-llama", "https://api.groq.com/openai/v1/chat/completions",
|
| 79 |
+
"GROQ_API_KEY", "llama-3.3-70b-versatile"),
|
| 80 |
+
("groq-qwen", "https://api.groq.com/openai/v1/chat/completions",
|
| 81 |
+
"GROQ_API_KEY", "qwen/qwen3-32b"),
|
| 82 |
+
("groq-gpt", "https://api.groq.com/openai/v1/chat/completions",
|
| 83 |
+
"GROQ_API_KEY", "openai/gpt-oss-120b"),
|
| 84 |
+
("cerebras-qwen", "https://api.cerebras.ai/v1/chat/completions",
|
| 85 |
+
"CEREBRAS_API_KEY", "qwen-3-235b-a22b-instruct-2507"),
|
| 86 |
+
("cerebras-gpt", "https://api.cerebras.ai/v1/chat/completions",
|
| 87 |
+
"CEREBRAS_API_KEY", "gpt-oss-120b"),
|
| 88 |
+
("openrouter1", "https://openrouter.ai/api/v1/chat/completions",
|
| 89 |
+
"OPENROUTER_API_KEY", "meta-llama/llama-3.3-70b-instruct:free"),
|
| 90 |
+
("openrouter2", "https://openrouter.ai/api/v1/chat/completions",
|
| 91 |
+
"OPENROUTER_API_KEY_2", "deepseek/deepseek-chat-v3.1:free"),
|
| 92 |
+
]
|
| 93 |
+
|
| 94 |
+
PROMPTS = {
|
| 95 |
+
"knowledge": (
|
| 96 |
+
"Distill the document into 4-8 instruction/response training pairs for a "
|
| 97 |
+
"senior code+SRE+DevSecOps LLM (Surrogate-1).\n"
|
| 98 |
+
"Source ({source}):\n```\n{text}\n```\n"
|
| 99 |
+
"Each pair: realistic engineer question + expert response with real APIs/CLIs/standards. "
|
| 100 |
+
"100-500 words/response. NO `<thinking>` blocks. NO hedging. "
|
| 101 |
+
"Output ONLY valid JSONL, one per line: "
|
| 102 |
+
'{{"prompt":"...","response":"..."}}'
|
| 103 |
+
),
|
| 104 |
+
"memory": (
|
| 105 |
+
"Convert this lessons-learned/preferences/past-mistake memory file into 3-6 training pairs "
|
| 106 |
+
"that BAKE the lesson into the model.\n"
|
| 107 |
+
"Source ({source}):\n```\n{text}\n```\n"
|
| 108 |
+
"Each pair: prompt = the kind of question that would trigger the lesson. "
|
| 109 |
+
"response = the lesson applied correctly. "
|
| 110 |
+
"Output ONLY valid JSONL: "
|
| 111 |
+
'{{"prompt":"...","response":"..."}}'
|
| 112 |
+
),
|
| 113 |
+
"skill": (
|
| 114 |
+
"Convert this SKILL.md into 5-8 skill-demonstration training pairs.\n"
|
| 115 |
+
"Source ({source}):\n```\n{text}\n```\n"
|
| 116 |
+
"Each pair: prompt = realistic user request invoking the skill. "
|
| 117 |
+
"response = complete expert demonstration in the skill's specified format. "
|
| 118 |
+
"Output ONLY valid JSONL: "
|
| 119 |
+
'{{"prompt":"...","response":"..."}}'
|
| 120 |
+
),
|
| 121 |
+
"agent": (
|
| 122 |
+
"Convert this Claude agent definition into 8-12 role-persona training pairs.\n"
|
| 123 |
+
"Source ({source}):\n```\n{text}\n```\n"
|
| 124 |
+
"Each pair: prompt = realistic task delegated to this agent. "
|
| 125 |
+
"response = agent's expert response IN-ROLE using its specified output format. "
|
| 126 |
+
"Output ONLY valid JSONL: "
|
| 127 |
+
'{{"prompt":"...","response":"..."}}'
|
| 128 |
+
),
|
| 129 |
+
"decision": (
|
| 130 |
+
"Extract technique knowledge from this AI research decision log into 3-5 Q&A pairs.\n"
|
| 131 |
+
"Source ({source}):\n```\n{text}\n```\n"
|
| 132 |
+
"Capture: what + when to use + how to wire + expected gain + cost + alternatives. "
|
| 133 |
+
"Output ONLY valid JSONL: "
|
| 134 |
+
'{{"prompt":"...","response":"..."}}'
|
| 135 |
+
),
|
| 136 |
+
"conversation": (
|
| 137 |
+
"From this engineerβassistant conversation transcript, extract 3-8 (instruction, "
|
| 138 |
+
"expert-response) pairs from the GOOD moments (skip filler, skip wrong-answers "
|
| 139 |
+
"the user later corrected). Make pairs self-contained.\n"
|
| 140 |
+
"Source ({source}):\n```\n{text}\n```\n"
|
| 141 |
+
"Output ONLY valid JSONL: "
|
| 142 |
+
'{{"prompt":"...","response":"..."}}'
|
| 143 |
+
),
|
| 144 |
+
"feature-build": (
|
| 145 |
+
"The user asked Claude to BUILD a feature in Surrogate-1 model. Convert this "
|
| 146 |
+
"request β expert-implementation pair into a FEATURE-BUILD training pair so the "
|
| 147 |
+
"model itself learns to handle similar requests.\n"
|
| 148 |
+
"User request: ```{text}```\n"
|
| 149 |
+
"Source ({source})\n"
|
| 150 |
+
"Output ONE valid JSON: "
|
| 151 |
+
'{{"prompt":"<self-contained restatement of the build request>",'
|
| 152 |
+
'"response":"<concrete expert build: code/diff/spec/architecture>"}}'
|
| 153 |
+
),
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
def call_api(name, url, key_env, model, prompt, max_tokens=4000, timeout=45):
|
| 158 |
+
key = os.environ.get(key_env, "")
|
| 159 |
+
if not key:
|
| 160 |
+
return None, f"no {key_env}"
|
| 161 |
+
try:
|
| 162 |
+
req = request.Request(
|
| 163 |
+
url,
|
| 164 |
+
data=json.dumps({
|
| 165 |
+
"model": model,
|
| 166 |
+
"messages": [{"role": "user", "content": prompt}],
|
| 167 |
+
"max_tokens": max_tokens,
|
| 168 |
+
"temperature": 0.4,
|
| 169 |
+
}).encode(),
|
| 170 |
+
headers={"Authorization": f"Bearer {key}",
|
| 171 |
+
"Content-Type": "application/json",
|
| 172 |
+
"User-Agent": "surrogate-1-distiller/1.0",
|
| 173 |
+
"Accept": "application/json"},
|
| 174 |
+
)
|
| 175 |
+
with request.urlopen(req, timeout=timeout) as r:
|
| 176 |
+
d = json.loads(r.read().decode())
|
| 177 |
+
return d["choices"][0]["message"]["content"], None
|
| 178 |
+
except error.HTTPError as e:
|
| 179 |
+
try: msg = e.read().decode()[:200]
|
| 180 |
+
except Exception: msg = ""
|
| 181 |
+
return None, f"{name} {e.code}: {msg}"
|
| 182 |
+
except Exception as e:
|
| 183 |
+
return None, f"{name} {type(e).__name__}: {str(e)[:120]}"
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
def call_frontier(prompt: str) -> str | None:
|
| 187 |
+
errs = []
|
| 188 |
+
for name, url, key_env, model in APIS:
|
| 189 |
+
out, err = call_api(name, url, key_env, model, prompt)
|
| 190 |
+
if out:
|
| 191 |
+
return out
|
| 192 |
+
if err: errs.append(err[:90])
|
| 193 |
+
print(" frontier all-fail:", " | ".join(errs[:3]))
|
| 194 |
+
return None
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def parse_jsonl(text: str) -> list[dict]:
|
| 198 |
+
pairs = []
|
| 199 |
+
for L in text.splitlines():
|
| 200 |
+
L = L.strip()
|
| 201 |
+
if not L or L.startswith("```") or L.startswith("//"):
|
| 202 |
+
continue
|
| 203 |
+
try:
|
| 204 |
+
j = json.loads(L)
|
| 205 |
+
if isinstance(j, dict) and "prompt" in j and "response" in j:
|
| 206 |
+
pairs.append(j)
|
| 207 |
+
except Exception:
|
| 208 |
+
continue
|
| 209 |
+
if not pairs:
|
| 210 |
+
for m in re.finditer(r'\{\s*"prompt"\s*:.*?\}\s*(?=\{|\Z)', text, re.S):
|
| 211 |
+
try:
|
| 212 |
+
j = json.loads(m.group(0))
|
| 213 |
+
if "prompt" in j and "response" in j:
|
| 214 |
+
pairs.append(j)
|
| 215 |
+
except Exception:
|
| 216 |
+
continue
|
| 217 |
+
return pairs
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def chunk_text(text: str, max_chars=8000) -> list[str]:
|
| 221 |
+
if len(text) <= max_chars: return [text]
|
| 222 |
+
chunks, i = [], 0
|
| 223 |
+
while i < len(text):
|
| 224 |
+
chunk = text[i:i + max_chars]
|
| 225 |
+
if i + max_chars < len(text):
|
| 226 |
+
cut = chunk.rfind("\n\n")
|
| 227 |
+
if cut > max_chars // 2:
|
| 228 |
+
chunk = chunk[:cut]
|
| 229 |
+
chunks.append(chunk)
|
| 230 |
+
i += len(chunk)
|
| 231 |
+
return chunks
|
| 232 |
+
|
| 233 |
+
|
| 234 |
+
def distill_dir(src_dir: Path, kind: str, out_path: Path):
|
| 235 |
+
print(f"\nββ distill {kind} from {src_dir} ββ")
|
| 236 |
+
if not src_dir.exists():
|
| 237 |
+
print(f" skip: {src_dir} missing"); return 0
|
| 238 |
+
files = sorted(src_dir.rglob("*.md"))
|
| 239 |
+
if not files: files = sorted(src_dir.rglob("*"))
|
| 240 |
+
files = [f for f in files if f.is_file()]
|
| 241 |
+
print(f" found {len(files)} files")
|
| 242 |
+
|
| 243 |
+
seen = set()
|
| 244 |
+
if out_path.exists():
|
| 245 |
+
with out_path.open() as f:
|
| 246 |
+
for L in f:
|
| 247 |
+
try:
|
| 248 |
+
j = json.loads(L)
|
| 249 |
+
h = hashlib.sha256((j["prompt"][:200] + j["response"][:200]).encode()).hexdigest()[:16]
|
| 250 |
+
seen.add(h)
|
| 251 |
+
except Exception: continue
|
| 252 |
+
|
| 253 |
+
f_out = out_path.open("a")
|
| 254 |
+
n_pairs = 0
|
| 255 |
+
for i, fp in enumerate(files, 1):
|
| 256 |
+
try:
|
| 257 |
+
text = fp.read_text(errors="replace")
|
| 258 |
+
except Exception:
|
| 259 |
+
continue
|
| 260 |
+
if len(text) < 100: continue
|
| 261 |
+
for chunk in chunk_text(text):
|
| 262 |
+
prompt = PROMPTS[kind].format(text=chunk, source=str(fp))
|
| 263 |
+
raw = call_frontier(prompt)
|
| 264 |
+
if not raw: continue
|
| 265 |
+
for j in parse_jsonl(raw):
|
| 266 |
+
h = hashlib.sha256((j["prompt"][:200] + j["response"][:200]).encode()).hexdigest()[:16]
|
| 267 |
+
if h in seen: continue
|
| 268 |
+
seen.add(h)
|
| 269 |
+
j["source"] = str(fp); j["kind"] = kind
|
| 270 |
+
f_out.write(json.dumps(j, ensure_ascii=False) + "\n")
|
| 271 |
+
f_out.flush()
|
| 272 |
+
n_pairs += 1
|
| 273 |
+
if i % 10 == 0:
|
| 274 |
+
print(f" [{i}/{len(files)}] total pairs: {n_pairs}")
|
| 275 |
+
f_out.close()
|
| 276 |
+
print(f" β {kind}: {n_pairs} pairs β {out_path}")
|
| 277 |
+
return n_pairs
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
def push_dataset(jsonl_path: Path, repo: str):
|
| 281 |
+
if not jsonl_path.exists() or jsonl_path.stat().st_size == 0:
|
| 282 |
+
print(f" skip empty {jsonl_path}"); return
|
| 283 |
+
api = HfApi(token=os.environ["HF_TOKEN"])
|
| 284 |
+
try: create_repo(repo, repo_type="dataset", exist_ok=True, private=False)
|
| 285 |
+
except Exception as e: print(f" create_repo: {e}")
|
| 286 |
+
api.upload_file(path_or_fileobj=str(jsonl_path),
|
| 287 |
+
path_in_repo="train.jsonl",
|
| 288 |
+
repo_id=repo, repo_type="dataset",
|
| 289 |
+
commit_message="surrogate-1 V10 ingest from Kaggle")
|
| 290 |
+
n = sum(1 for _ in jsonl_path.open())
|
| 291 |
+
print(f" β pushed {n} pairs β https://huggingface.co/datasets/{repo}")
|
| 292 |
+
|
| 293 |
+
|
| 294 |
+
# ββ MAIN ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 295 |
+
def main():
|
| 296 |
+
print("\nβββ V10 INGEST KERNEL β Kaggle T4Γ2 βββ\n")
|
| 297 |
+
|
| 298 |
+
# 1. Pull source bundle
|
| 299 |
+
print("Step 1: pull source bundle from HF...")
|
| 300 |
+
bundle_dir = WORK / "bundle"
|
| 301 |
+
try:
|
| 302 |
+
snapshot_download(repo_id=SRC_BUNDLE_REPO, repo_type="dataset",
|
| 303 |
+
local_dir=str(bundle_dir),
|
| 304 |
+
token=os.environ.get("HF_TOKEN"))
|
| 305 |
+
except Exception as e:
|
| 306 |
+
print(f" β pull bundle failed: {e}")
|
| 307 |
+
print(f" β run bundle-and-upload.sh on Mac first to push {SRC_BUNDLE_REPO}")
|
| 308 |
+
return 1
|
| 309 |
+
|
| 310 |
+
bundle_tgz = bundle_dir / SRC_BUNDLE_FILE
|
| 311 |
+
if not bundle_tgz.exists():
|
| 312 |
+
print(f" β {bundle_tgz} missing in pulled snapshot")
|
| 313 |
+
return 1
|
| 314 |
+
|
| 315 |
+
# 2. Extract
|
| 316 |
+
extract_dir = WORK / "src"
|
| 317 |
+
extract_dir.mkdir(exist_ok=True)
|
| 318 |
+
print(f"\nStep 2: extract {bundle_tgz} β {extract_dir}")
|
| 319 |
+
with tarfile.open(str(bundle_tgz), "r:gz") as tf:
|
| 320 |
+
tf.extractall(str(extract_dir))
|
| 321 |
+
for sub in extract_dir.iterdir():
|
| 322 |
+
if sub.is_dir():
|
| 323 |
+
n = sum(1 for _ in sub.rglob("*") if _.is_file())
|
| 324 |
+
print(f" {sub.name}: {n} files")
|
| 325 |
+
|
| 326 |
+
# 3. Distill each source kind β its own JSONL β push to HF
|
| 327 |
+
pipeline = [
|
| 328 |
+
("vault", "knowledge", "axentx/surrogate-1-knowledge-vault"),
|
| 329 |
+
("patterns", "knowledge", "axentx/surrogate-1-knowledge-patterns"),
|
| 330 |
+
("memory", "memory", "axentx/surrogate-1-knowledge-memory"),
|
| 331 |
+
("skills", "skill", "axentx/surrogate-1-skills-mirror"),
|
| 332 |
+
("agents", "agent", "axentx/surrogate-1-roles-claude-builtin"),
|
| 333 |
+
("arkship-decisions", "decision", "axentx/surrogate-1-arkship-decisions"),
|
| 334 |
+
("axentx-decisions", "decision", "axentx/surrogate-1-axentx-decisions"),
|
| 335 |
+
("conversations", "conversation", "axentx/surrogate-1-conversations"),
|
| 336 |
+
("feature-builds","feature-build","axentx/surrogate-1-feature-builds"),
|
| 337 |
+
]
|
| 338 |
+
|
| 339 |
+
out_dir = WORK / "out"
|
| 340 |
+
out_dir.mkdir(exist_ok=True)
|
| 341 |
+
summary = {}
|
| 342 |
+
for sub_name, kind, repo in pipeline:
|
| 343 |
+
src = extract_dir / sub_name
|
| 344 |
+
if not src.exists():
|
| 345 |
+
print(f"\nββ skip {sub_name} (not in bundle) ββ")
|
| 346 |
+
continue
|
| 347 |
+
out_jsonl = out_dir / f"{sub_name}.jsonl"
|
| 348 |
+
n = distill_dir(src, kind, out_jsonl)
|
| 349 |
+
summary[sub_name] = n
|
| 350 |
+
if n > 0:
|
| 351 |
+
push_dataset(out_jsonl, repo)
|
| 352 |
+
|
| 353 |
+
# 4. Build merged for trainer convenience
|
| 354 |
+
merged = out_dir / "merged.jsonl"
|
| 355 |
+
with merged.open("w") as out:
|
| 356 |
+
for f in out_dir.glob("*.jsonl"):
|
| 357 |
+
if f.name == "merged.jsonl": continue
|
| 358 |
+
out.write(f.read_text())
|
| 359 |
+
if merged.stat().st_size > 0:
|
| 360 |
+
push_dataset(merged, "axentx/surrogate-1-v10-merged")
|
| 361 |
+
|
| 362 |
+
print("\nβββββββββββοΏ½οΏ½οΏ½ββββββββββββββββββββββ")
|
| 363 |
+
print(" V10 INGEST COMPLETE")
|
| 364 |
+
print("ββββββββββββββββββββββββββββββββββ")
|
| 365 |
+
for k, v in summary.items():
|
| 366 |
+
print(f" {k:20s}: {v} pairs")
|
| 367 |
+
print(f" TOTAL: {sum(summary.values())} pairs across {len(summary)} datasets")
|
| 368 |
+
print("\nNext: V10 trainer kernel will pull these via merge_external()")
|
| 369 |
+
return 0
|
| 370 |
+
|
| 371 |
+
|
| 372 |
+
if __name__ == "__main__":
|
| 373 |
+
sys.exit(main())
|
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Push a local JSONL β HF dataset repo. Creates the repo if missing.
|
| 3 |
+
Fixes the 404 datasets the audit found.
|
| 4 |
+
"""
|
| 5 |
+
from __future__ import annotations
|
| 6 |
+
import argparse, os, sys
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
|
| 9 |
+
def main() -> int:
|
| 10 |
+
p = argparse.ArgumentParser()
|
| 11 |
+
p.add_argument("--input", required=True, help="JSONL file to upload")
|
| 12 |
+
p.add_argument("--repo", required=True, help="HF dataset repo, e.g. axentx/foo")
|
| 13 |
+
p.add_argument("--filename", default="train.jsonl")
|
| 14 |
+
p.add_argument("--private", action="store_true")
|
| 15 |
+
args = p.parse_args()
|
| 16 |
+
|
| 17 |
+
token = os.environ.get("HF_TOKEN") or os.environ.get("HF_TOKEN_PRO_WRITE")
|
| 18 |
+
if not token:
|
| 19 |
+
sys.stderr.write("HF_TOKEN missing\n"); return 2
|
| 20 |
+
|
| 21 |
+
src = Path(args.input)
|
| 22 |
+
if not src.exists() or src.stat().st_size == 0:
|
| 23 |
+
sys.stderr.write(f"{src} missing or empty\n"); return 2
|
| 24 |
+
|
| 25 |
+
from huggingface_hub import HfApi, create_repo
|
| 26 |
+
api = HfApi(token=token)
|
| 27 |
+
try:
|
| 28 |
+
create_repo(args.repo, repo_type="dataset", exist_ok=True,
|
| 29 |
+
private=args.private)
|
| 30 |
+
print(f" β repo ready: {args.repo}")
|
| 31 |
+
except Exception as e:
|
| 32 |
+
print(f" create_repo: {e}")
|
| 33 |
+
api.upload_file(path_or_fileobj=str(src),
|
| 34 |
+
path_in_repo=args.filename,
|
| 35 |
+
repo_id=args.repo,
|
| 36 |
+
repo_type="dataset",
|
| 37 |
+
commit_message=f"surrogate-1 V10 ingest: {src.name}")
|
| 38 |
+
n_lines = sum(1 for _ in src.open())
|
| 39 |
+
print(f" β pushed {n_lines:,} lines β https://huggingface.co/datasets/{args.repo}")
|
| 40 |
+
return 0
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
if __name__ == "__main__":
|
| 44 |
+
sys.exit(main())
|