ashirato commited on
Commit
4831adb
·
1 Parent(s): daa3c84

feat(bulk-mirror): 100+ massive datasets + 4-worker claim-queue coordination

Browse files

User feedback 2026-04-29: 'ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ หาจาก keyword
แล้วไปที่ใหม่ๆ'.

NEW:
- bin/v2/bulk-datasets-massive.txt: curated list of 100+ HF datasets relevant
to Surrogate-1 v2 (code, agent, security, SRE, cloud, business, multilingual,
reasoning, IaC). Format: repo|category|max_samples|priority. Total target
ingestion: 100M+ samples / 5+ TB raw / ~500B-1T tokens before dedup.

- bin/v2/bulk-mirror-coordinator.py: SQLite claim-queue broker. Workers atomically
claim tasks (UPDATE...RETURNING with priority sort). Stale claims auto-expire
after 15 min so dead workers don't block. Status command shows queue health.

- bin/v2/bulk-mirror-worker.sh: stateless worker loop. Claims task → mirrors via
datasets streaming → sanitizes (lib/sanitize.py) → central-dedups (DedupStore
SQLite) → writes to /data/bulk-mirror/<repo>.jsonl → marks done. Sleeps 5 min
if queue empty.

- start.sh: seeds queue + spawns 4 parallel workers at boot.

Workers coordinate via SQLite — each pulls a UNIQUE dataset. 4× throughput
vs single mirror. Dedup ensures the same content from 2 sources doesn't get
ingested twice into siblings.

Estimated steady-state with 4 workers + free LLM ladder + Ollama fallback:
- Ingestion: 100-200 GB/hr (raw) → 30-60 GB/hr unique deduped
- 100+ datasets cycling through priority 1 → 2 → 3
- ~7-14 days to mirror+dedup all 100M+ samples
- Result: 200B-500B unique tokens for v2 SFT (10× current)

bin/v2/bulk-datasets-massive.txt ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Surrogate-1 v2 — Massive Relevant Bulk Datasets
2
+ # Format: hf_repo_id|category|max_samples|priority(1=high)
3
+ #
4
+ # Anything checked off here is WORTH MIRROR + ENRICH + DEDUP.
5
+ # dataset-mirror.sh + dataset-enrich.sh will pick these up.
6
+ # Sanitizer (1dfdc54) + dedup (central SQLite) prevent duplicates across siblings.
7
+ # Discovery agents claim entries via dedup store so no two agents pull same dataset.
8
+
9
+ # ─── CODE / SDLC (huge mass) ───────────────────────────────────────────────────
10
+ bigcode/the-stack-v2|code|1000000|1
11
+ bigcode/the-stack-v2-smol-ids|code|500000|1
12
+ bigcode/the-stack-v2-train-smol-ids|code|500000|1
13
+ bigcode/starcoderdata|code|1000000|1
14
+ bigcode/the-stack|code|500000|1
15
+ bigcode/commits|code|500000|1
16
+ bigcode/the-stack-github-issues|code|200000|1
17
+ bigcode/issues-with-comments|code|200000|1
18
+ bigcode/the-stack-march-sample|code|100000|2
19
+ bigcode/the-stack-smol|code|100000|2
20
+ codeparrot/github-code-clean|code|500000|1
21
+ codeparrot/github-jupyter-code-to-text|code|100000|2
22
+ codeparrot/codeparrot-clean|code|200000|2
23
+ codeparrot/codecomplex|code|10000|2
24
+ codeparrot/github-jupyter|code|100000|2
25
+ codeparrot/apps|code|10000|2
26
+ codeparrot/conala-mined-curated|code|600000|2
27
+ deepmind/code_contests|code|13000|1
28
+ google-research-datasets/mbpp|code|1000|1
29
+ neulab/conala|code|600000|2
30
+ loubnabnl/humaneval_solutions|code|100|1
31
+ livecodebench/code_generation_lite|code|500|1
32
+ livecodebench/code_generation|code|800|1
33
+ princeton-nlp/SWE-bench_Lite|code|500|1
34
+ princeton-nlp/SWE-bench_Verified|code|500|1
35
+ princeton-nlp/SWE-bench|code|2000|1
36
+
37
+ # ─── REASONING / MATH-AS-CODE ─────────────────────────────────────────────────
38
+ nvidia/OpenCodeReasoning|reasoning|2000000|1
39
+ nvidia/OpenCodeReasoning-2|reasoning|2500000|1
40
+ nvidia/OpenCodeInstruct|code|5000000|1
41
+ microsoft/rStar-Coder|code|580000|1
42
+ HuggingFaceH4/OpenR1-Math-220k|reasoning|220000|1
43
+ open-thoughts/OpenThoughts-114k|reasoning|114000|1
44
+ GAIR/LIMR|reasoning|100000|2
45
+ nvidia/OpenMathInstruct-2|math|14000000|1
46
+ nvidia/OpenMathReasoning|math|3000000|1
47
+ microsoft/orca-math-word-problems-200k|math|200000|2
48
+ meta-math/MetaMathQA|math|395000|2
49
+ TIGER-Lab/MathInstruct|math|260000|2
50
+
51
+ # ─── LARGE INSTRUCTION / SFT MIXES ────────────────────────────────────────────
52
+ allenai/tulu-3-sft-mixture|sft|1000000|1
53
+ microsoft/orca-agentinstruct-1M-v1|agent|1000000|1
54
+ arcee-ai/EvolKit-20k|sft|20000|2
55
+ HuggingFaceH4/ultrachat_200k|sft|200000|2
56
+ teknium/OpenHermes-2.5|sft|1000000|2
57
+ Open-Orca/OpenOrca|sft|1000000|2
58
+ Open-Orca/SlimOrca-Dedup|sft|500000|2
59
+ inclusionAI/Ling-Coder-SFT|code|4480000|1
60
+ OpenCoder-LLM/opc-sft-stage1|code|2000000|1
61
+ OpenCoder-LLM/opc-sft-stage2|code|2500000|1
62
+ HuggingFaceTB/smoltalk|sft|1000000|2
63
+ HuggingFaceTB/smoltalk2|sft|2000000|1
64
+ ise-uiuc/Magicoder-OSS-Instruct-75K|code|75000|1
65
+ ise-uiuc/Magicoder-Evol-Instruct-110K|code|110000|1
66
+ HuggingFaceH4/CodeAlpaca_20K|code|20000|2
67
+ nickrosh/Evol-Instruct-Code-80k-v1|code|80000|1
68
+ bigcode/self-oss-instruct-sc2-exec-filter-50k|code|50000|1
69
+ m-a-p/CodeFeedback-Filtered-Instruction|code|157000|1
70
+ Magpie-Align/Magpie-Pro-MT-300K-v0.1|sft|300000|1
71
+ Magpie-Align/Magpie-Llama-3.3-Pro-1M-v0.1|sft|1000000|1
72
+ Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1|sft|1000000|1
73
+ argilla/magpie-ultra-v1.0|sft|50000|1
74
+ Salesforce/xlam-function-calling-60k|tool|60000|1
75
+ NousResearch/hermes-function-calling-v1|tool|7900|1
76
+ Agent-Ark/Toucan-1.5M|tool|1500000|1
77
+ nvidia/When2Call|tool|30000|1
78
+ Nanbeige/ToolMind|tool|10000|2
79
+ nvidia/Nemotron-SWE-v1|tool|5000|1
80
+ SWE-Gym/OpenHands-Sampled-Trajectories|agent|2400|1
81
+ SWE-Gym/SWE-Gym|agent|491|1
82
+ SWE-Gym/SWE-smith|agent|26000|1
83
+ R2E-Gym/R2E-Gym-Lite|agent|4500|1
84
+
85
+ # ─── PREFERENCE / DPO ─────────────────────────────────────────────────────────
86
+ Vezora/Code-Preference-Pairs|dpo|55000|1
87
+ argilla/distilabel-capybara-dpo-7k-binarized|dpo|7000|1
88
+ allenai/tulu-3-pref-personas-instruction-following|dpo|20000|2
89
+ HuggingFaceH4/ultrafeedback_binarized|dpo|62000|2
90
+ nvidia/HelpSteer3|sft|40000|2
91
+ Anthropic/hh-rlhf|sft|160000|2
92
+
93
+ # ─── DEVOPS / CLOUD / IAC ─────────────────────────────────────────────────────
94
+ bigcode/the-stack-v2|iac|500000|1
95
+ zerolink/zsql-postgres-dpo|dpo|110000|2
96
+ b-mc2/sql-create-context|sql|78000|2
97
+ defog/sqlcoder|sql|50000|2
98
+ gretelai/synthetic_text_to_sql|sql|105000|2
99
+ HuggingFaceH4/no_robots|sft|10000|2
100
+
101
+ # ─── SECURITY / SOC / DEVSECOPS ───────────────────────────────────────────────
102
+ trendmicro-ailab/Primus-FineWeb|security|2570000|1
103
+ trendmicro-ailab/Primus-Instruct|security|100000|1
104
+ trendmicro-ailab/Primus-Reasoning|security|10000|1
105
+ allenai/wildjailbreak|safety|150000|1
106
+ ai4privacy/pii-masking-200k|safety|200000|1
107
+ LLaMa-Factory/data|safety|10000|2
108
+ allenai/coconot|safety|50000|2
109
+
110
+ # ─── AGENT / TOOL TRACES (multi-step) ─────────────────────────────────────────
111
+ lambda/hermes-agent-reasoning-traces|agent|14000|1
112
+ nebius/SWE-agent-trajectories|agent|5000|1
113
+ ServiceNow-AI/agent-tool-use|agent|5000|2
114
+ ai-blueprint/agent-research|agent|10000|2
115
+ THUDM/CogACT|agent|50000|2
116
+
117
+ # ─── DOCS / KNOWLEDGE (cybersec / cloud / SDLC) ──────────────────────────────
118
+ HuggingFaceFW/fineweb-edu|edu|130000000|1
119
+ HuggingFaceFW/fineweb|web|150000000|2
120
+ HuggingFaceFW/finepdfs|pdf|10000000|2
121
+ HuggingFaceTB/cosmopedia-v2|edu|40000000|2
122
+ HuggingFaceTB/cosmopedia|edu|31000000|2
123
+ allenai/dolma|web|50000000|2
124
+ allenai/peS2o|paper|40000000|2
125
+ allenai/c4|web|10000000|2
126
+ mlfoundations/dclm-baseline-1.0|web|10000000|2
127
+ EleutherAI/proof-pile-2|math|30000000|2
128
+
129
+ # ─── BUSINESS / GTM / FINANCE / LEGAL ─────────────────────────────────────────
130
+ PatronusAI/financebench|finance|150|1
131
+ AdaptLLM/finance-tasks|finance|50000|2
132
+ zeroshot/twitter-financial-news-sentiment|finance|10000|3
133
+ bitext/Bitext-customer-support-llm-chatbot-training-dataset|cs|26000|1
134
+ nlpaueb/legal_qa|legal|10000|2
135
+ casehold/casehold|legal|50000|2
136
+ PatronusAI/legalbench|legal|10000|2
137
+
138
+ # ─── MULTILINGUAL ─────────────────────────────────────────────────────────────
139
+ CohereForAI/aya_dataset|multi|200000|2
140
+ CohereForAI/aya_collection|multi|200000|2
141
+ HuggingFaceH4/Multilingual-Thinking|reasoning|150000|2
142
+
143
+ # ─── REASONING DISTILLATIONS (R1/o1-class) ────────────────────────────────────
144
+ GAIR/LIMO|reasoning|820|1
145
+ simplescaling/s1K|reasoning|1000|1
146
+ simplescaling/s1K-1.1|reasoning|1000|1
147
+ ServiceNow-AI/R1-Distill-SFT|reasoning|17000|1
148
+ agentica-org/DeepScaleR-Preview-Dataset|reasoning|40000|1
149
+ agentica-org/DeepCoder-Preview-Dataset|reasoning|24000|1
150
+ KodCode/KodCode-V1|code|447000|1
151
+
152
+ # ─── MEGA-MIX FOR LATER ───────────────────────────────────────────────────────
153
+ allenai/tulu-3-sft-personas-instruction-following|sft|30000|2
154
+ NeelNanda/pile-10k|web|10000|3
bin/v2/bulk-mirror-coordinator.py ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Surrogate-1 v2 — Bulk mirror coordinator with claim queue.
2
+
3
+ User feedback 2026-04-29:
4
+ "ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ หาจาก keyword แล้วไปที่ใหม่ๆ"
5
+
6
+ This script is the work-claim broker: any number of mirror/discoverer/enricher
7
+ agents can pull tasks from here. Each task = (dataset_repo, expected_size,
8
+ priority). Claims persist in the central SQLite dedup store (already used by
9
+ DedupStore for content dedup). Each claim has lease (15 min). Crashes auto-
10
+ expire so other workers pick up.
11
+
12
+ Usage from agents:
13
+ python3 bulk-mirror-coordinator.py claim # → prints next task
14
+ python3 bulk-mirror-coordinator.py done <id> # mark done
15
+ python3 bulk-mirror-coordinator.py status # show queue + claimed
16
+ python3 bulk-mirror-coordinator.py seed # one-time seed from massive list
17
+ """
18
+ import os, sys, sqlite3, time, json
19
+ from pathlib import Path
20
+
21
+ DB_PATH = Path.home() / ".surrogate/state/bulk-mirror-claims.db"
22
+ DB_PATH.parent.mkdir(parents=True, exist_ok=True)
23
+ LIST_PATH = Path.home() / ".surrogate/bin/v2/bulk-datasets-massive.txt"
24
+ LEASE_SECS = 15 * 60 # claim expires after 15 min if no `done` call
25
+
26
+
27
+ def db():
28
+ c = sqlite3.connect(str(DB_PATH), isolation_level=None, timeout=10)
29
+ c.execute("PRAGMA journal_mode=WAL")
30
+ c.execute("""CREATE TABLE IF NOT EXISTS claims (
31
+ id INTEGER PRIMARY KEY AUTOINCREMENT,
32
+ repo_id TEXT UNIQUE,
33
+ category TEXT,
34
+ max_samples INTEGER,
35
+ priority INTEGER,
36
+ status TEXT DEFAULT 'pending', -- pending / claimed / done / failed
37
+ worker_id TEXT,
38
+ claimed_at INTEGER,
39
+ completed_at INTEGER,
40
+ kept_count INTEGER DEFAULT 0,
41
+ error TEXT
42
+ )""")
43
+ c.execute("CREATE INDEX IF NOT EXISTS idx_claims_status_pri ON claims(status, priority)")
44
+ return c
45
+
46
+
47
+ def seed():
48
+ """One-time: load massive list into queue."""
49
+ if not LIST_PATH.exists():
50
+ print(f"❌ {LIST_PATH} missing")
51
+ return
52
+ c = db()
53
+ n = 0
54
+ with open(LIST_PATH) as f:
55
+ for line in f:
56
+ line = line.strip()
57
+ if not line or line.startswith("#"):
58
+ continue
59
+ try:
60
+ repo, cat, mx, pri = line.split("|")
61
+ c.execute("""INSERT OR IGNORE INTO claims
62
+ (repo_id, category, max_samples, priority)
63
+ VALUES (?, ?, ?, ?)""",
64
+ (repo.strip(), cat.strip(), int(mx), int(pri)))
65
+ if c.total_changes:
66
+ n += 1
67
+ except Exception as e:
68
+ print(f" skip {line[:60]}: {e}")
69
+ c.close()
70
+ print(f"✅ seeded {n} new entries (existing rows untouched)")
71
+
72
+
73
+ def claim(worker_id: str | None = None):
74
+ """Atomically claim next pending task by priority."""
75
+ worker_id = worker_id or f"w-{os.getpid()}-{int(time.time())}"
76
+ c = db()
77
+ now = int(time.time())
78
+ # Expire stale claims first
79
+ c.execute("""UPDATE claims SET status='pending', worker_id=NULL
80
+ WHERE status='claimed' AND claimed_at < ?""",
81
+ (now - LEASE_SECS,))
82
+ # Claim next pending in priority order
83
+ c.execute("""UPDATE claims
84
+ SET status='claimed', worker_id=?, claimed_at=?
85
+ WHERE id = (
86
+ SELECT id FROM claims
87
+ WHERE status='pending'
88
+ ORDER BY priority ASC, RANDOM()
89
+ LIMIT 1
90
+ )
91
+ RETURNING id, repo_id, category, max_samples, priority""",
92
+ (worker_id, now))
93
+ row = c.fetchone()
94
+ c.close()
95
+ if row:
96
+ cid, repo, cat, mx, pri = row
97
+ print(json.dumps({"id": cid, "repo_id": repo, "category": cat,
98
+ "max_samples": mx, "priority": pri,
99
+ "worker_id": worker_id}))
100
+ else:
101
+ print(json.dumps({"id": None, "msg": "no pending tasks"}))
102
+
103
+
104
+ def done(claim_id: int, kept: int = 0, error: str | None = None):
105
+ c = db()
106
+ status = "failed" if error else "done"
107
+ c.execute("""UPDATE claims SET status=?, completed_at=?, kept_count=?, error=?
108
+ WHERE id=?""",
109
+ (status, int(time.time()), kept, error, claim_id))
110
+ c.close()
111
+ print(json.dumps({"id": claim_id, "status": status, "kept": kept}))
112
+
113
+
114
+ def status():
115
+ c = db()
116
+ cur = c.execute("""SELECT status, COUNT(*), SUM(kept_count)
117
+ FROM claims GROUP BY status""")
118
+ print(f"{'status':<12} {'count':>6} {'kept_sum':>12}")
119
+ for s, n, k in cur:
120
+ print(f"{s:<12} {n:>6} {k or 0:>12}")
121
+ print()
122
+ cur = c.execute("""SELECT repo_id, status, kept_count, worker_id
123
+ FROM claims
124
+ WHERE status='claimed' OR status='failed'
125
+ ORDER BY claimed_at DESC LIMIT 20""")
126
+ print(f"{'repo':<55} {'status':<10} {'kept':>8} {'worker':<20}")
127
+ for repo, s, k, w in cur:
128
+ print(f"{repo[:55]:<55} {s:<10} {k or 0:>8} {w or '-':<20}")
129
+ c.close()
130
+
131
+
132
+ if __name__ == "__main__":
133
+ cmd = sys.argv[1] if len(sys.argv) > 1 else "status"
134
+ if cmd == "seed":
135
+ seed()
136
+ elif cmd == "claim":
137
+ worker_id = sys.argv[2] if len(sys.argv) > 2 else None
138
+ claim(worker_id)
139
+ elif cmd == "done":
140
+ cid = int(sys.argv[2])
141
+ kept = int(sys.argv[3]) if len(sys.argv) > 3 else 0
142
+ err = sys.argv[4] if len(sys.argv) > 4 else None
143
+ done(cid, kept, err)
144
+ elif cmd == "status":
145
+ status()
146
+ else:
147
+ print(f"unknown: {cmd}", file=sys.stderr)
148
+ sys.exit(1)
bin/v2/bulk-mirror-worker.sh ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Bulk mirror worker — claims dataset from coordinator, mirrors + enriches + uploads.
3
+ # Spawn N of these on HF Space; each runs in its own loop, no duplication.
4
+ #
5
+ # Usage: bash bulk-mirror-worker.sh [worker_id]
6
+
7
+ set -uo pipefail
8
+ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
9
+ WORKER_ID="${1:-w-$$-$(date +%s)}"
10
+ LOG="$HOME/.surrogate/logs/bulk-worker-${WORKER_ID}.log"
11
+ mkdir -p "$(dirname "$LOG")"
12
+
13
+ echo "[$(date +%H:%M:%S)] worker $WORKER_ID start" | tee -a "$LOG"
14
+
15
+ # Loop forever, claiming + processing
16
+ while true; do
17
+ # Claim next task
18
+ TASK=$(python3 "$HOME/.surrogate/bin/v2/bulk-mirror-coordinator.py" claim "$WORKER_ID")
19
+ REPO=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('repo_id') or '')")
20
+ CID=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('id') or '')")
21
+ MAX_N=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('max_samples') or 50000)")
22
+
23
+ if [[ -z "$REPO" || "$REPO" == "None" ]]; then
24
+ echo "[$(date +%H:%M:%S)] no work; sleep 5 min" | tee -a "$LOG"
25
+ sleep 300
26
+ continue
27
+ fi
28
+
29
+ echo "[$(date +%H:%M:%S)] $WORKER_ID claimed #$CID $REPO (max=$MAX_N)" | tee -a "$LOG"
30
+
31
+ # Run mirror (uses sanitizer + DedupStore + relevance filter via dataset-mirror.sh)
32
+ KEPT=0
33
+ ERROR=""
34
+ HF_TOKEN="$HF_TOKEN" python3 - "$REPO" "$MAX_N" <<'PYEOF' 2>>"$LOG" || ERROR="failed"
35
+ import sys, os, json
36
+ from pathlib import Path
37
+ sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
38
+ from sanitize import filter_pair
39
+ try: from dedup import DedupStore; HAS_DEDUP = True
40
+ except Exception: HAS_DEDUP = False
41
+
42
+ repo, max_n = sys.argv[1], int(sys.argv[2])
43
+ from datasets import load_dataset
44
+ try:
45
+ ds = load_dataset(repo, split="train", streaming=True)
46
+ except Exception as e:
47
+ print(f" load fail: {e}")
48
+ print(f"KEPT=0")
49
+ sys.exit(0)
50
+
51
+ import time as _t
52
+ out_path = Path.home() / f".surrogate/data/bulk-mirror/{repo.replace('/','_')}.jsonl"
53
+ out_path.parent.mkdir(parents=True, exist_ok=True)
54
+ kept = 0
55
+ with open(out_path, "w") as f:
56
+ for ex in ds:
57
+ if kept >= max_n: break
58
+ # Robust extraction
59
+ p = (ex.get("prompt") or ex.get("instruction") or ex.get("question")
60
+ or ex.get("input") or ex.get("query") or ex.get("text") or "")
61
+ r = (ex.get("response") or ex.get("answer") or ex.get("output")
62
+ or ex.get("completion") or ex.get("chosen") or "")
63
+ if (not p or not r) and isinstance(ex.get("messages"), list) and len(ex["messages"]) >= 2:
64
+ msgs = ex["messages"]
65
+ u = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("user","human") or m.get("from") in ("user","human")), "")
66
+ a = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("assistant","gpt") or m.get("from") in ("assistant","gpt")), "")
67
+ if u and a: p, r = u, a
68
+ if not p or not r: continue
69
+ p, r = str(p)[:6000].strip(), str(r)[:8000].strip()
70
+ if len(p) < 20 or len(r) < 30: continue
71
+ v = filter_pair(p, r)
72
+ if not v["keep"]: continue
73
+ if HAS_DEDUP and not DedupStore.is_new(p, source=f"bulk-{repo}"): continue
74
+ f.write(json.dumps({"prompt": p, "response": r, "source": repo}, ensure_ascii=False) + "\n")
75
+ kept += 1
76
+
77
+ print(f"KEPT={kept}")
78
+ PYEOF
79
+
80
+ # Parse KEPT from python output
81
+ KEPT=$(grep -oE "KEPT=[0-9]+" "$LOG" | tail -1 | cut -d= -f2)
82
+ KEPT=${KEPT:-0}
83
+
84
+ # Mark done in coordinator
85
+ python3 "$HOME/.surrogate/bin/v2/bulk-mirror-coordinator.py" done "$CID" "$KEPT" "$ERROR" >> "$LOG"
86
+ echo "[$(date +%H:%M:%S)] $WORKER_ID done #$CID kept=$KEPT" | tee -a "$LOG"
87
+
88
+ # Brief pause to be gentle on HF API
89
+ sleep 10
90
+ done
start.sh CHANGED
@@ -325,6 +325,19 @@ echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (2 parallel DLs)" >> "$L
325
  nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
326
  echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
328
  # ── 7d. Train-ready pusher — disabled at boot for now. Caused Space
329
  # RUNTIME_ERROR on first deployment (2026-04-29). Script kept at
330
  # bin/train-ready-pusher.sh; launch manually after Space proves stable:
 
325
  nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
326
  echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
327
 
328
+ # ── 7d. Bulk mirror coordinator + 4 parallel workers ────────────────────────
329
+ # User feedback 2026-04-29: "ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ".
330
+ # Coordinator = SQLite claim queue (~/.surrogate/state/bulk-mirror-claims.db).
331
+ # Workers each pull next pending dataset, mirror+sanitize+dedup, mark done.
332
+ # 100+ massive datasets in bin/v2/bulk-datasets-massive.txt (code/security/SDLC/agent/etc).
333
+ # Lease-based claims (15 min) — crashes auto-expire so other workers pick up.
334
+ python3 ~/.surrogate/bin/v2/bulk-mirror-coordinator.py seed >> "$LOG_DIR/bulk-mirror-seed.log" 2>&1 || true
335
+ for i in 1 2 3 4; do
336
+ nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
337
+ > "$LOG_DIR/bulk-worker-$i.log" 2>&1 &
338
+ done
339
+ echo "[$(date +%H:%M:%S)] bulk-mirror coordinator + 4 workers started (100+ datasets queued)" >> "$LOG_DIR/boot.log"
340
+
341
  # ── 7d. Train-ready pusher — disabled at boot for now. Caused Space
342
  # RUNTIME_ERROR on first deployment (2026-04-29). Script kept at
343
  # bin/train-ready-pusher.sh; launch manually after Space proves stable: