Spaces:
Runtime error
feat(bulk-mirror): 100+ massive datasets + 4-worker claim-queue coordination
Browse filesUser feedback 2026-04-29: 'ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ หาจาก keyword
แล้วไปที่ใหม่ๆ'.
NEW:
- bin/v2/bulk-datasets-massive.txt: curated list of 100+ HF datasets relevant
to Surrogate-1 v2 (code, agent, security, SRE, cloud, business, multilingual,
reasoning, IaC). Format: repo|category|max_samples|priority. Total target
ingestion: 100M+ samples / 5+ TB raw / ~500B-1T tokens before dedup.
- bin/v2/bulk-mirror-coordinator.py: SQLite claim-queue broker. Workers atomically
claim tasks (UPDATE...RETURNING with priority sort). Stale claims auto-expire
after 15 min so dead workers don't block. Status command shows queue health.
- bin/v2/bulk-mirror-worker.sh: stateless worker loop. Claims task → mirrors via
datasets streaming → sanitizes (lib/sanitize.py) → central-dedups (DedupStore
SQLite) → writes to /data/bulk-mirror/<repo>.jsonl → marks done. Sleeps 5 min
if queue empty.
- start.sh: seeds queue + spawns 4 parallel workers at boot.
Workers coordinate via SQLite — each pulls a UNIQUE dataset. 4× throughput
vs single mirror. Dedup ensures the same content from 2 sources doesn't get
ingested twice into siblings.
Estimated steady-state with 4 workers + free LLM ladder + Ollama fallback:
- Ingestion: 100-200 GB/hr (raw) → 30-60 GB/hr unique deduped
- 100+ datasets cycling through priority 1 → 2 → 3
- ~7-14 days to mirror+dedup all 100M+ samples
- Result: 200B-500B unique tokens for v2 SFT (10× current)
- bin/v2/bulk-datasets-massive.txt +154 -0
- bin/v2/bulk-mirror-coordinator.py +148 -0
- bin/v2/bulk-mirror-worker.sh +90 -0
- start.sh +13 -0
|
@@ -0,0 +1,154 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 — Massive Relevant Bulk Datasets
|
| 2 |
+
# Format: hf_repo_id|category|max_samples|priority(1=high)
|
| 3 |
+
#
|
| 4 |
+
# Anything checked off here is WORTH MIRROR + ENRICH + DEDUP.
|
| 5 |
+
# dataset-mirror.sh + dataset-enrich.sh will pick these up.
|
| 6 |
+
# Sanitizer (1dfdc54) + dedup (central SQLite) prevent duplicates across siblings.
|
| 7 |
+
# Discovery agents claim entries via dedup store so no two agents pull same dataset.
|
| 8 |
+
|
| 9 |
+
# ─── CODE / SDLC (huge mass) ───────────────────────────────────────────────────
|
| 10 |
+
bigcode/the-stack-v2|code|1000000|1
|
| 11 |
+
bigcode/the-stack-v2-smol-ids|code|500000|1
|
| 12 |
+
bigcode/the-stack-v2-train-smol-ids|code|500000|1
|
| 13 |
+
bigcode/starcoderdata|code|1000000|1
|
| 14 |
+
bigcode/the-stack|code|500000|1
|
| 15 |
+
bigcode/commits|code|500000|1
|
| 16 |
+
bigcode/the-stack-github-issues|code|200000|1
|
| 17 |
+
bigcode/issues-with-comments|code|200000|1
|
| 18 |
+
bigcode/the-stack-march-sample|code|100000|2
|
| 19 |
+
bigcode/the-stack-smol|code|100000|2
|
| 20 |
+
codeparrot/github-code-clean|code|500000|1
|
| 21 |
+
codeparrot/github-jupyter-code-to-text|code|100000|2
|
| 22 |
+
codeparrot/codeparrot-clean|code|200000|2
|
| 23 |
+
codeparrot/codecomplex|code|10000|2
|
| 24 |
+
codeparrot/github-jupyter|code|100000|2
|
| 25 |
+
codeparrot/apps|code|10000|2
|
| 26 |
+
codeparrot/conala-mined-curated|code|600000|2
|
| 27 |
+
deepmind/code_contests|code|13000|1
|
| 28 |
+
google-research-datasets/mbpp|code|1000|1
|
| 29 |
+
neulab/conala|code|600000|2
|
| 30 |
+
loubnabnl/humaneval_solutions|code|100|1
|
| 31 |
+
livecodebench/code_generation_lite|code|500|1
|
| 32 |
+
livecodebench/code_generation|code|800|1
|
| 33 |
+
princeton-nlp/SWE-bench_Lite|code|500|1
|
| 34 |
+
princeton-nlp/SWE-bench_Verified|code|500|1
|
| 35 |
+
princeton-nlp/SWE-bench|code|2000|1
|
| 36 |
+
|
| 37 |
+
# ─── REASONING / MATH-AS-CODE ─────────────────────────────────────────────────
|
| 38 |
+
nvidia/OpenCodeReasoning|reasoning|2000000|1
|
| 39 |
+
nvidia/OpenCodeReasoning-2|reasoning|2500000|1
|
| 40 |
+
nvidia/OpenCodeInstruct|code|5000000|1
|
| 41 |
+
microsoft/rStar-Coder|code|580000|1
|
| 42 |
+
HuggingFaceH4/OpenR1-Math-220k|reasoning|220000|1
|
| 43 |
+
open-thoughts/OpenThoughts-114k|reasoning|114000|1
|
| 44 |
+
GAIR/LIMR|reasoning|100000|2
|
| 45 |
+
nvidia/OpenMathInstruct-2|math|14000000|1
|
| 46 |
+
nvidia/OpenMathReasoning|math|3000000|1
|
| 47 |
+
microsoft/orca-math-word-problems-200k|math|200000|2
|
| 48 |
+
meta-math/MetaMathQA|math|395000|2
|
| 49 |
+
TIGER-Lab/MathInstruct|math|260000|2
|
| 50 |
+
|
| 51 |
+
# ─── LARGE INSTRUCTION / SFT MIXES ────────────────────────────────────────────
|
| 52 |
+
allenai/tulu-3-sft-mixture|sft|1000000|1
|
| 53 |
+
microsoft/orca-agentinstruct-1M-v1|agent|1000000|1
|
| 54 |
+
arcee-ai/EvolKit-20k|sft|20000|2
|
| 55 |
+
HuggingFaceH4/ultrachat_200k|sft|200000|2
|
| 56 |
+
teknium/OpenHermes-2.5|sft|1000000|2
|
| 57 |
+
Open-Orca/OpenOrca|sft|1000000|2
|
| 58 |
+
Open-Orca/SlimOrca-Dedup|sft|500000|2
|
| 59 |
+
inclusionAI/Ling-Coder-SFT|code|4480000|1
|
| 60 |
+
OpenCoder-LLM/opc-sft-stage1|code|2000000|1
|
| 61 |
+
OpenCoder-LLM/opc-sft-stage2|code|2500000|1
|
| 62 |
+
HuggingFaceTB/smoltalk|sft|1000000|2
|
| 63 |
+
HuggingFaceTB/smoltalk2|sft|2000000|1
|
| 64 |
+
ise-uiuc/Magicoder-OSS-Instruct-75K|code|75000|1
|
| 65 |
+
ise-uiuc/Magicoder-Evol-Instruct-110K|code|110000|1
|
| 66 |
+
HuggingFaceH4/CodeAlpaca_20K|code|20000|2
|
| 67 |
+
nickrosh/Evol-Instruct-Code-80k-v1|code|80000|1
|
| 68 |
+
bigcode/self-oss-instruct-sc2-exec-filter-50k|code|50000|1
|
| 69 |
+
m-a-p/CodeFeedback-Filtered-Instruction|code|157000|1
|
| 70 |
+
Magpie-Align/Magpie-Pro-MT-300K-v0.1|sft|300000|1
|
| 71 |
+
Magpie-Align/Magpie-Llama-3.3-Pro-1M-v0.1|sft|1000000|1
|
| 72 |
+
Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1|sft|1000000|1
|
| 73 |
+
argilla/magpie-ultra-v1.0|sft|50000|1
|
| 74 |
+
Salesforce/xlam-function-calling-60k|tool|60000|1
|
| 75 |
+
NousResearch/hermes-function-calling-v1|tool|7900|1
|
| 76 |
+
Agent-Ark/Toucan-1.5M|tool|1500000|1
|
| 77 |
+
nvidia/When2Call|tool|30000|1
|
| 78 |
+
Nanbeige/ToolMind|tool|10000|2
|
| 79 |
+
nvidia/Nemotron-SWE-v1|tool|5000|1
|
| 80 |
+
SWE-Gym/OpenHands-Sampled-Trajectories|agent|2400|1
|
| 81 |
+
SWE-Gym/SWE-Gym|agent|491|1
|
| 82 |
+
SWE-Gym/SWE-smith|agent|26000|1
|
| 83 |
+
R2E-Gym/R2E-Gym-Lite|agent|4500|1
|
| 84 |
+
|
| 85 |
+
# ─── PREFERENCE / DPO ─────────────────────────────────────────────────────────
|
| 86 |
+
Vezora/Code-Preference-Pairs|dpo|55000|1
|
| 87 |
+
argilla/distilabel-capybara-dpo-7k-binarized|dpo|7000|1
|
| 88 |
+
allenai/tulu-3-pref-personas-instruction-following|dpo|20000|2
|
| 89 |
+
HuggingFaceH4/ultrafeedback_binarized|dpo|62000|2
|
| 90 |
+
nvidia/HelpSteer3|sft|40000|2
|
| 91 |
+
Anthropic/hh-rlhf|sft|160000|2
|
| 92 |
+
|
| 93 |
+
# ─── DEVOPS / CLOUD / IAC ─────────────────────────────────────────────────────
|
| 94 |
+
bigcode/the-stack-v2|iac|500000|1
|
| 95 |
+
zerolink/zsql-postgres-dpo|dpo|110000|2
|
| 96 |
+
b-mc2/sql-create-context|sql|78000|2
|
| 97 |
+
defog/sqlcoder|sql|50000|2
|
| 98 |
+
gretelai/synthetic_text_to_sql|sql|105000|2
|
| 99 |
+
HuggingFaceH4/no_robots|sft|10000|2
|
| 100 |
+
|
| 101 |
+
# ─── SECURITY / SOC / DEVSECOPS ───────────────────────────────────────────────
|
| 102 |
+
trendmicro-ailab/Primus-FineWeb|security|2570000|1
|
| 103 |
+
trendmicro-ailab/Primus-Instruct|security|100000|1
|
| 104 |
+
trendmicro-ailab/Primus-Reasoning|security|10000|1
|
| 105 |
+
allenai/wildjailbreak|safety|150000|1
|
| 106 |
+
ai4privacy/pii-masking-200k|safety|200000|1
|
| 107 |
+
LLaMa-Factory/data|safety|10000|2
|
| 108 |
+
allenai/coconot|safety|50000|2
|
| 109 |
+
|
| 110 |
+
# ─── AGENT / TOOL TRACES (multi-step) ─────────────────────────────────────────
|
| 111 |
+
lambda/hermes-agent-reasoning-traces|agent|14000|1
|
| 112 |
+
nebius/SWE-agent-trajectories|agent|5000|1
|
| 113 |
+
ServiceNow-AI/agent-tool-use|agent|5000|2
|
| 114 |
+
ai-blueprint/agent-research|agent|10000|2
|
| 115 |
+
THUDM/CogACT|agent|50000|2
|
| 116 |
+
|
| 117 |
+
# ─── DOCS / KNOWLEDGE (cybersec / cloud / SDLC) ──────────────────────────────
|
| 118 |
+
HuggingFaceFW/fineweb-edu|edu|130000000|1
|
| 119 |
+
HuggingFaceFW/fineweb|web|150000000|2
|
| 120 |
+
HuggingFaceFW/finepdfs|pdf|10000000|2
|
| 121 |
+
HuggingFaceTB/cosmopedia-v2|edu|40000000|2
|
| 122 |
+
HuggingFaceTB/cosmopedia|edu|31000000|2
|
| 123 |
+
allenai/dolma|web|50000000|2
|
| 124 |
+
allenai/peS2o|paper|40000000|2
|
| 125 |
+
allenai/c4|web|10000000|2
|
| 126 |
+
mlfoundations/dclm-baseline-1.0|web|10000000|2
|
| 127 |
+
EleutherAI/proof-pile-2|math|30000000|2
|
| 128 |
+
|
| 129 |
+
# ─── BUSINESS / GTM / FINANCE / LEGAL ─────────────────────────────────────────
|
| 130 |
+
PatronusAI/financebench|finance|150|1
|
| 131 |
+
AdaptLLM/finance-tasks|finance|50000|2
|
| 132 |
+
zeroshot/twitter-financial-news-sentiment|finance|10000|3
|
| 133 |
+
bitext/Bitext-customer-support-llm-chatbot-training-dataset|cs|26000|1
|
| 134 |
+
nlpaueb/legal_qa|legal|10000|2
|
| 135 |
+
casehold/casehold|legal|50000|2
|
| 136 |
+
PatronusAI/legalbench|legal|10000|2
|
| 137 |
+
|
| 138 |
+
# ─── MULTILINGUAL ─────────────────────────────────────────────────────────────
|
| 139 |
+
CohereForAI/aya_dataset|multi|200000|2
|
| 140 |
+
CohereForAI/aya_collection|multi|200000|2
|
| 141 |
+
HuggingFaceH4/Multilingual-Thinking|reasoning|150000|2
|
| 142 |
+
|
| 143 |
+
# ─── REASONING DISTILLATIONS (R1/o1-class) ────────────────────────────────────
|
| 144 |
+
GAIR/LIMO|reasoning|820|1
|
| 145 |
+
simplescaling/s1K|reasoning|1000|1
|
| 146 |
+
simplescaling/s1K-1.1|reasoning|1000|1
|
| 147 |
+
ServiceNow-AI/R1-Distill-SFT|reasoning|17000|1
|
| 148 |
+
agentica-org/DeepScaleR-Preview-Dataset|reasoning|40000|1
|
| 149 |
+
agentica-org/DeepCoder-Preview-Dataset|reasoning|24000|1
|
| 150 |
+
KodCode/KodCode-V1|code|447000|1
|
| 151 |
+
|
| 152 |
+
# ─── MEGA-MIX FOR LATER ───────────────────────────────────────────────────────
|
| 153 |
+
allenai/tulu-3-sft-personas-instruction-following|sft|30000|2
|
| 154 |
+
NeelNanda/pile-10k|web|10000|3
|
|
@@ -0,0 +1,148 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Surrogate-1 v2 — Bulk mirror coordinator with claim queue.
|
| 2 |
+
|
| 3 |
+
User feedback 2026-04-29:
|
| 4 |
+
"ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ หาจาก keyword แล้วไปที่ใหม่ๆ"
|
| 5 |
+
|
| 6 |
+
This script is the work-claim broker: any number of mirror/discoverer/enricher
|
| 7 |
+
agents can pull tasks from here. Each task = (dataset_repo, expected_size,
|
| 8 |
+
priority). Claims persist in the central SQLite dedup store (already used by
|
| 9 |
+
DedupStore for content dedup). Each claim has lease (15 min). Crashes auto-
|
| 10 |
+
expire so other workers pick up.
|
| 11 |
+
|
| 12 |
+
Usage from agents:
|
| 13 |
+
python3 bulk-mirror-coordinator.py claim # → prints next task
|
| 14 |
+
python3 bulk-mirror-coordinator.py done <id> # mark done
|
| 15 |
+
python3 bulk-mirror-coordinator.py status # show queue + claimed
|
| 16 |
+
python3 bulk-mirror-coordinator.py seed # one-time seed from massive list
|
| 17 |
+
"""
|
| 18 |
+
import os, sys, sqlite3, time, json
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
DB_PATH = Path.home() / ".surrogate/state/bulk-mirror-claims.db"
|
| 22 |
+
DB_PATH.parent.mkdir(parents=True, exist_ok=True)
|
| 23 |
+
LIST_PATH = Path.home() / ".surrogate/bin/v2/bulk-datasets-massive.txt"
|
| 24 |
+
LEASE_SECS = 15 * 60 # claim expires after 15 min if no `done` call
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def db():
|
| 28 |
+
c = sqlite3.connect(str(DB_PATH), isolation_level=None, timeout=10)
|
| 29 |
+
c.execute("PRAGMA journal_mode=WAL")
|
| 30 |
+
c.execute("""CREATE TABLE IF NOT EXISTS claims (
|
| 31 |
+
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
| 32 |
+
repo_id TEXT UNIQUE,
|
| 33 |
+
category TEXT,
|
| 34 |
+
max_samples INTEGER,
|
| 35 |
+
priority INTEGER,
|
| 36 |
+
status TEXT DEFAULT 'pending', -- pending / claimed / done / failed
|
| 37 |
+
worker_id TEXT,
|
| 38 |
+
claimed_at INTEGER,
|
| 39 |
+
completed_at INTEGER,
|
| 40 |
+
kept_count INTEGER DEFAULT 0,
|
| 41 |
+
error TEXT
|
| 42 |
+
)""")
|
| 43 |
+
c.execute("CREATE INDEX IF NOT EXISTS idx_claims_status_pri ON claims(status, priority)")
|
| 44 |
+
return c
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
def seed():
|
| 48 |
+
"""One-time: load massive list into queue."""
|
| 49 |
+
if not LIST_PATH.exists():
|
| 50 |
+
print(f"❌ {LIST_PATH} missing")
|
| 51 |
+
return
|
| 52 |
+
c = db()
|
| 53 |
+
n = 0
|
| 54 |
+
with open(LIST_PATH) as f:
|
| 55 |
+
for line in f:
|
| 56 |
+
line = line.strip()
|
| 57 |
+
if not line or line.startswith("#"):
|
| 58 |
+
continue
|
| 59 |
+
try:
|
| 60 |
+
repo, cat, mx, pri = line.split("|")
|
| 61 |
+
c.execute("""INSERT OR IGNORE INTO claims
|
| 62 |
+
(repo_id, category, max_samples, priority)
|
| 63 |
+
VALUES (?, ?, ?, ?)""",
|
| 64 |
+
(repo.strip(), cat.strip(), int(mx), int(pri)))
|
| 65 |
+
if c.total_changes:
|
| 66 |
+
n += 1
|
| 67 |
+
except Exception as e:
|
| 68 |
+
print(f" skip {line[:60]}: {e}")
|
| 69 |
+
c.close()
|
| 70 |
+
print(f"✅ seeded {n} new entries (existing rows untouched)")
|
| 71 |
+
|
| 72 |
+
|
| 73 |
+
def claim(worker_id: str | None = None):
|
| 74 |
+
"""Atomically claim next pending task by priority."""
|
| 75 |
+
worker_id = worker_id or f"w-{os.getpid()}-{int(time.time())}"
|
| 76 |
+
c = db()
|
| 77 |
+
now = int(time.time())
|
| 78 |
+
# Expire stale claims first
|
| 79 |
+
c.execute("""UPDATE claims SET status='pending', worker_id=NULL
|
| 80 |
+
WHERE status='claimed' AND claimed_at < ?""",
|
| 81 |
+
(now - LEASE_SECS,))
|
| 82 |
+
# Claim next pending in priority order
|
| 83 |
+
c.execute("""UPDATE claims
|
| 84 |
+
SET status='claimed', worker_id=?, claimed_at=?
|
| 85 |
+
WHERE id = (
|
| 86 |
+
SELECT id FROM claims
|
| 87 |
+
WHERE status='pending'
|
| 88 |
+
ORDER BY priority ASC, RANDOM()
|
| 89 |
+
LIMIT 1
|
| 90 |
+
)
|
| 91 |
+
RETURNING id, repo_id, category, max_samples, priority""",
|
| 92 |
+
(worker_id, now))
|
| 93 |
+
row = c.fetchone()
|
| 94 |
+
c.close()
|
| 95 |
+
if row:
|
| 96 |
+
cid, repo, cat, mx, pri = row
|
| 97 |
+
print(json.dumps({"id": cid, "repo_id": repo, "category": cat,
|
| 98 |
+
"max_samples": mx, "priority": pri,
|
| 99 |
+
"worker_id": worker_id}))
|
| 100 |
+
else:
|
| 101 |
+
print(json.dumps({"id": None, "msg": "no pending tasks"}))
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def done(claim_id: int, kept: int = 0, error: str | None = None):
|
| 105 |
+
c = db()
|
| 106 |
+
status = "failed" if error else "done"
|
| 107 |
+
c.execute("""UPDATE claims SET status=?, completed_at=?, kept_count=?, error=?
|
| 108 |
+
WHERE id=?""",
|
| 109 |
+
(status, int(time.time()), kept, error, claim_id))
|
| 110 |
+
c.close()
|
| 111 |
+
print(json.dumps({"id": claim_id, "status": status, "kept": kept}))
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def status():
|
| 115 |
+
c = db()
|
| 116 |
+
cur = c.execute("""SELECT status, COUNT(*), SUM(kept_count)
|
| 117 |
+
FROM claims GROUP BY status""")
|
| 118 |
+
print(f"{'status':<12} {'count':>6} {'kept_sum':>12}")
|
| 119 |
+
for s, n, k in cur:
|
| 120 |
+
print(f"{s:<12} {n:>6} {k or 0:>12}")
|
| 121 |
+
print()
|
| 122 |
+
cur = c.execute("""SELECT repo_id, status, kept_count, worker_id
|
| 123 |
+
FROM claims
|
| 124 |
+
WHERE status='claimed' OR status='failed'
|
| 125 |
+
ORDER BY claimed_at DESC LIMIT 20""")
|
| 126 |
+
print(f"{'repo':<55} {'status':<10} {'kept':>8} {'worker':<20}")
|
| 127 |
+
for repo, s, k, w in cur:
|
| 128 |
+
print(f"{repo[:55]:<55} {s:<10} {k or 0:>8} {w or '-':<20}")
|
| 129 |
+
c.close()
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
if __name__ == "__main__":
|
| 133 |
+
cmd = sys.argv[1] if len(sys.argv) > 1 else "status"
|
| 134 |
+
if cmd == "seed":
|
| 135 |
+
seed()
|
| 136 |
+
elif cmd == "claim":
|
| 137 |
+
worker_id = sys.argv[2] if len(sys.argv) > 2 else None
|
| 138 |
+
claim(worker_id)
|
| 139 |
+
elif cmd == "done":
|
| 140 |
+
cid = int(sys.argv[2])
|
| 141 |
+
kept = int(sys.argv[3]) if len(sys.argv) > 3 else 0
|
| 142 |
+
err = sys.argv[4] if len(sys.argv) > 4 else None
|
| 143 |
+
done(cid, kept, err)
|
| 144 |
+
elif cmd == "status":
|
| 145 |
+
status()
|
| 146 |
+
else:
|
| 147 |
+
print(f"unknown: {cmd}", file=sys.stderr)
|
| 148 |
+
sys.exit(1)
|
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Bulk mirror worker — claims dataset from coordinator, mirrors + enriches + uploads.
|
| 3 |
+
# Spawn N of these on HF Space; each runs in its own loop, no duplication.
|
| 4 |
+
#
|
| 5 |
+
# Usage: bash bulk-mirror-worker.sh [worker_id]
|
| 6 |
+
|
| 7 |
+
set -uo pipefail
|
| 8 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 9 |
+
WORKER_ID="${1:-w-$$-$(date +%s)}"
|
| 10 |
+
LOG="$HOME/.surrogate/logs/bulk-worker-${WORKER_ID}.log"
|
| 11 |
+
mkdir -p "$(dirname "$LOG")"
|
| 12 |
+
|
| 13 |
+
echo "[$(date +%H:%M:%S)] worker $WORKER_ID start" | tee -a "$LOG"
|
| 14 |
+
|
| 15 |
+
# Loop forever, claiming + processing
|
| 16 |
+
while true; do
|
| 17 |
+
# Claim next task
|
| 18 |
+
TASK=$(python3 "$HOME/.surrogate/bin/v2/bulk-mirror-coordinator.py" claim "$WORKER_ID")
|
| 19 |
+
REPO=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('repo_id') or '')")
|
| 20 |
+
CID=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('id') or '')")
|
| 21 |
+
MAX_N=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('max_samples') or 50000)")
|
| 22 |
+
|
| 23 |
+
if [[ -z "$REPO" || "$REPO" == "None" ]]; then
|
| 24 |
+
echo "[$(date +%H:%M:%S)] no work; sleep 5 min" | tee -a "$LOG"
|
| 25 |
+
sleep 300
|
| 26 |
+
continue
|
| 27 |
+
fi
|
| 28 |
+
|
| 29 |
+
echo "[$(date +%H:%M:%S)] $WORKER_ID claimed #$CID $REPO (max=$MAX_N)" | tee -a "$LOG"
|
| 30 |
+
|
| 31 |
+
# Run mirror (uses sanitizer + DedupStore + relevance filter via dataset-mirror.sh)
|
| 32 |
+
KEPT=0
|
| 33 |
+
ERROR=""
|
| 34 |
+
HF_TOKEN="$HF_TOKEN" python3 - "$REPO" "$MAX_N" <<'PYEOF' 2>>"$LOG" || ERROR="failed"
|
| 35 |
+
import sys, os, json
|
| 36 |
+
from pathlib import Path
|
| 37 |
+
sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
| 38 |
+
from sanitize import filter_pair
|
| 39 |
+
try: from dedup import DedupStore; HAS_DEDUP = True
|
| 40 |
+
except Exception: HAS_DEDUP = False
|
| 41 |
+
|
| 42 |
+
repo, max_n = sys.argv[1], int(sys.argv[2])
|
| 43 |
+
from datasets import load_dataset
|
| 44 |
+
try:
|
| 45 |
+
ds = load_dataset(repo, split="train", streaming=True)
|
| 46 |
+
except Exception as e:
|
| 47 |
+
print(f" load fail: {e}")
|
| 48 |
+
print(f"KEPT=0")
|
| 49 |
+
sys.exit(0)
|
| 50 |
+
|
| 51 |
+
import time as _t
|
| 52 |
+
out_path = Path.home() / f".surrogate/data/bulk-mirror/{repo.replace('/','_')}.jsonl"
|
| 53 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 54 |
+
kept = 0
|
| 55 |
+
with open(out_path, "w") as f:
|
| 56 |
+
for ex in ds:
|
| 57 |
+
if kept >= max_n: break
|
| 58 |
+
# Robust extraction
|
| 59 |
+
p = (ex.get("prompt") or ex.get("instruction") or ex.get("question")
|
| 60 |
+
or ex.get("input") or ex.get("query") or ex.get("text") or "")
|
| 61 |
+
r = (ex.get("response") or ex.get("answer") or ex.get("output")
|
| 62 |
+
or ex.get("completion") or ex.get("chosen") or "")
|
| 63 |
+
if (not p or not r) and isinstance(ex.get("messages"), list) and len(ex["messages"]) >= 2:
|
| 64 |
+
msgs = ex["messages"]
|
| 65 |
+
u = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("user","human") or m.get("from") in ("user","human")), "")
|
| 66 |
+
a = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("assistant","gpt") or m.get("from") in ("assistant","gpt")), "")
|
| 67 |
+
if u and a: p, r = u, a
|
| 68 |
+
if not p or not r: continue
|
| 69 |
+
p, r = str(p)[:6000].strip(), str(r)[:8000].strip()
|
| 70 |
+
if len(p) < 20 or len(r) < 30: continue
|
| 71 |
+
v = filter_pair(p, r)
|
| 72 |
+
if not v["keep"]: continue
|
| 73 |
+
if HAS_DEDUP and not DedupStore.is_new(p, source=f"bulk-{repo}"): continue
|
| 74 |
+
f.write(json.dumps({"prompt": p, "response": r, "source": repo}, ensure_ascii=False) + "\n")
|
| 75 |
+
kept += 1
|
| 76 |
+
|
| 77 |
+
print(f"KEPT={kept}")
|
| 78 |
+
PYEOF
|
| 79 |
+
|
| 80 |
+
# Parse KEPT from python output
|
| 81 |
+
KEPT=$(grep -oE "KEPT=[0-9]+" "$LOG" | tail -1 | cut -d= -f2)
|
| 82 |
+
KEPT=${KEPT:-0}
|
| 83 |
+
|
| 84 |
+
# Mark done in coordinator
|
| 85 |
+
python3 "$HOME/.surrogate/bin/v2/bulk-mirror-coordinator.py" done "$CID" "$KEPT" "$ERROR" >> "$LOG"
|
| 86 |
+
echo "[$(date +%H:%M:%S)] $WORKER_ID done #$CID kept=$KEPT" | tee -a "$LOG"
|
| 87 |
+
|
| 88 |
+
# Brief pause to be gentle on HF API
|
| 89 |
+
sleep 10
|
| 90 |
+
done
|
|
@@ -325,6 +325,19 @@ echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (2 parallel DLs)" >> "$L
|
|
| 325 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
| 326 |
echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
|
| 327 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 328 |
# ── 7d. Train-ready pusher — disabled at boot for now. Caused Space
|
| 329 |
# RUNTIME_ERROR on first deployment (2026-04-29). Script kept at
|
| 330 |
# bin/train-ready-pusher.sh; launch manually after Space proves stable:
|
|
|
|
| 325 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
| 326 |
echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
|
| 327 |
|
| 328 |
+
# ── 7d. Bulk mirror coordinator + 4 parallel workers ────────────────────────
|
| 329 |
+
# User feedback 2026-04-29: "ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ".
|
| 330 |
+
# Coordinator = SQLite claim queue (~/.surrogate/state/bulk-mirror-claims.db).
|
| 331 |
+
# Workers each pull next pending dataset, mirror+sanitize+dedup, mark done.
|
| 332 |
+
# 100+ massive datasets in bin/v2/bulk-datasets-massive.txt (code/security/SDLC/agent/etc).
|
| 333 |
+
# Lease-based claims (15 min) — crashes auto-expire so other workers pick up.
|
| 334 |
+
python3 ~/.surrogate/bin/v2/bulk-mirror-coordinator.py seed >> "$LOG_DIR/bulk-mirror-seed.log" 2>&1 || true
|
| 335 |
+
for i in 1 2 3 4; do
|
| 336 |
+
nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
|
| 337 |
+
> "$LOG_DIR/bulk-worker-$i.log" 2>&1 &
|
| 338 |
+
done
|
| 339 |
+
echo "[$(date +%H:%M:%S)] bulk-mirror coordinator + 4 workers started (100+ datasets queued)" >> "$LOG_DIR/boot.log"
|
| 340 |
+
|
| 341 |
# ── 7d. Train-ready pusher — disabled at boot for now. Caused Space
|
| 342 |
# RUNTIME_ERROR on first deployment (2026-04-29). Script kept at
|
| 343 |
# bin/train-ready-pusher.sh; launch manually after Space proves stable:
|