Spaces:

axentx
/

surrogate-1

Runtime error

ashirato commited on 11 days ago

Commit

4831adb

1 Parent(s): daa3c84

feat(bulk-mirror): 100+ massive datasets + 4-worker claim-queue coordination

User feedback 2026-04-29: 'ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ หาจาก keyword
แล้วไปที่ใหม่ๆ'.

NEW:
- bin/v2/bulk-datasets-massive.txt: curated list of 100+ HF datasets relevant
to Surrogate-1 v2 (code, agent, security, SRE, cloud, business, multilingual,
reasoning, IaC). Format: repo|category|max_samples|priority. Total target
ingestion: 100M+ samples / 5+ TB raw / ~500B-1T tokens before dedup.

- bin/v2/bulk-mirror-coordinator.py: SQLite claim-queue broker. Workers atomically
claim tasks (UPDATE...RETURNING with priority sort). Stale claims auto-expire
after 15 min so dead workers don't block. Status command shows queue health.

- bin/v2/bulk-mirror-worker.sh: stateless worker loop. Claims task → mirrors via
datasets streaming → sanitizes (lib/sanitize.py) → central-dedups (DedupStore
SQLite) → writes to /data/bulk-mirror/<repo>.jsonl → marks done. Sleeps 5 min
if queue empty.

- start.sh: seeds queue + spawns 4 parallel workers at boot.

Workers coordinate via SQLite — each pulls a UNIQUE dataset. 4× throughput
vs single mirror. Dedup ensures the same content from 2 sources doesn't get
ingested twice into siblings.

Estimated steady-state with 4 workers + free LLM ladder + Ollama fallback:
- Ingestion: 100-200 GB/hr (raw) → 30-60 GB/hr unique deduped
- 100+ datasets cycling through priority 1 → 2 → 3
- ~7-14 days to mirror+dedup all 100M+ samples
- Result: 200B-500B unique tokens for v2 SFT (10× current)

Files changed (4) hide show

bin/v2/bulk-datasets-massive.txt +154 -0
bin/v2/bulk-mirror-coordinator.py +148 -0
bin/v2/bulk-mirror-worker.sh +90 -0
start.sh +13 -0

bin/v2/bulk-datasets-massive.txt ADDED Viewed

	@@ -0,0 +1,154 @@

+# Surrogate-1 v2 — Massive Relevant Bulk Datasets
+# Format: hf_repo_id|category|max_samples|priority(1=high)
+#
+# Anything checked off here is WORTH MIRROR + ENRICH + DEDUP.
+# dataset-mirror.sh + dataset-enrich.sh will pick these up.
+# Sanitizer (1dfdc54) + dedup (central SQLite) prevent duplicates across siblings.
+# Discovery agents claim entries via dedup store so no two agents pull same dataset.
+# ─── CODE / SDLC (huge mass) ───────────────────────────────────────────────────
+bigcode/the-stack-v2|code|1000000|1
+bigcode/the-stack-v2-smol-ids|code|500000|1
+bigcode/the-stack-v2-train-smol-ids|code|500000|1
+bigcode/starcoderdata|code|1000000|1
+bigcode/the-stack|code|500000|1
+bigcode/commits|code|500000|1
+bigcode/the-stack-github-issues|code|200000|1
+bigcode/issues-with-comments|code|200000|1
+bigcode/the-stack-march-sample|code|100000|2
+bigcode/the-stack-smol|code|100000|2
+codeparrot/github-code-clean|code|500000|1
+codeparrot/github-jupyter-code-to-text|code|100000|2
+codeparrot/codeparrot-clean|code|200000|2
+codeparrot/codecomplex|code|10000|2
+codeparrot/github-jupyter|code|100000|2
+codeparrot/apps|code|10000|2
+codeparrot/conala-mined-curated|code|600000|2
+deepmind/code_contests|code|13000|1
+google-research-datasets/mbpp|code|1000|1
+neulab/conala|code|600000|2
+loubnabnl/humaneval_solutions|code|100|1
+livecodebench/code_generation_lite|code|500|1
+livecodebench/code_generation|code|800|1
+princeton-nlp/SWE-bench_Lite|code|500|1
+princeton-nlp/SWE-bench_Verified|code|500|1
+princeton-nlp/SWE-bench|code|2000|1
+# ─── REASONING / MATH-AS-CODE ─────────────────────────────────────────────────
+nvidia/OpenCodeReasoning|reasoning|2000000|1
+nvidia/OpenCodeReasoning-2|reasoning|2500000|1
+nvidia/OpenCodeInstruct|code|5000000|1
+microsoft/rStar-Coder|code|580000|1
+HuggingFaceH4/OpenR1-Math-220k|reasoning|220000|1
+open-thoughts/OpenThoughts-114k|reasoning|114000|1
+GAIR/LIMR|reasoning|100000|2
+nvidia/OpenMathInstruct-2|math|14000000|1
+nvidia/OpenMathReasoning|math|3000000|1
+microsoft/orca-math-word-problems-200k|math|200000|2
+meta-math/MetaMathQA|math|395000|2
+TIGER-Lab/MathInstruct|math|260000|2
+# ─── LARGE INSTRUCTION / SFT MIXES ────────────────────────────────────────────
+allenai/tulu-3-sft-mixture|sft|1000000|1
+microsoft/orca-agentinstruct-1M-v1|agent|1000000|1
+arcee-ai/EvolKit-20k|sft|20000|2
+HuggingFaceH4/ultrachat_200k|sft|200000|2
+teknium/OpenHermes-2.5|sft|1000000|2
+Open-Orca/OpenOrca|sft|1000000|2
+Open-Orca/SlimOrca-Dedup|sft|500000|2
+inclusionAI/Ling-Coder-SFT|code|4480000|1
+OpenCoder-LLM/opc-sft-stage1|code|2000000|1
+OpenCoder-LLM/opc-sft-stage2|code|2500000|1
+HuggingFaceTB/smoltalk|sft|1000000|2
+HuggingFaceTB/smoltalk2|sft|2000000|1
+ise-uiuc/Magicoder-OSS-Instruct-75K|code|75000|1
+ise-uiuc/Magicoder-Evol-Instruct-110K|code|110000|1
+HuggingFaceH4/CodeAlpaca_20K|code|20000|2
+nickrosh/Evol-Instruct-Code-80k-v1|code|80000|1
+bigcode/self-oss-instruct-sc2-exec-filter-50k|code|50000|1
+m-a-p/CodeFeedback-Filtered-Instruction|code|157000|1
+Magpie-Align/Magpie-Pro-MT-300K-v0.1|sft|300000|1
+Magpie-Align/Magpie-Llama-3.3-Pro-1M-v0.1|sft|1000000|1
+Magpie-Align/Magpie-Qwen2.5-Pro-1M-v0.1|sft|1000000|1
+argilla/magpie-ultra-v1.0|sft|50000|1
+Salesforce/xlam-function-calling-60k|tool|60000|1
+NousResearch/hermes-function-calling-v1|tool|7900|1
+Agent-Ark/Toucan-1.5M|tool|1500000|1
+nvidia/When2Call|tool|30000|1
+Nanbeige/ToolMind|tool|10000|2
+nvidia/Nemotron-SWE-v1|tool|5000|1
+SWE-Gym/OpenHands-Sampled-Trajectories|agent|2400|1
+SWE-Gym/SWE-Gym|agent|491|1
+SWE-Gym/SWE-smith|agent|26000|1
+R2E-Gym/R2E-Gym-Lite|agent|4500|1
+# ─── PREFERENCE / DPO ─────────────────────────────────────────────────────────
+Vezora/Code-Preference-Pairs|dpo|55000|1
+argilla/distilabel-capybara-dpo-7k-binarized|dpo|7000|1
+allenai/tulu-3-pref-personas-instruction-following|dpo|20000|2
+HuggingFaceH4/ultrafeedback_binarized|dpo|62000|2
+nvidia/HelpSteer3|sft|40000|2
+Anthropic/hh-rlhf|sft|160000|2
+# ─── DEVOPS / CLOUD / IAC ─────────────────────────────────────────────────────
+bigcode/the-stack-v2|iac|500000|1
+zerolink/zsql-postgres-dpo|dpo|110000|2
+b-mc2/sql-create-context|sql|78000|2
+defog/sqlcoder|sql|50000|2
+gretelai/synthetic_text_to_sql|sql|105000|2
+HuggingFaceH4/no_robots|sft|10000|2
+# ─── SECURITY / SOC / DEVSECOPS ───────────────────────────────────────────────
+trendmicro-ailab/Primus-FineWeb|security|2570000|1
+trendmicro-ailab/Primus-Instruct|security|100000|1
+trendmicro-ailab/Primus-Reasoning|security|10000|1
+allenai/wildjailbreak|safety|150000|1
+ai4privacy/pii-masking-200k|safety|200000|1
+LLaMa-Factory/data|safety|10000|2
+allenai/coconot|safety|50000|2
+# ─── AGENT / TOOL TRACES (multi-step) ─────────────────────────────────────────
+lambda/hermes-agent-reasoning-traces|agent|14000|1
+nebius/SWE-agent-trajectories|agent|5000|1
+ServiceNow-AI/agent-tool-use|agent|5000|2
+ai-blueprint/agent-research|agent|10000|2
+THUDM/CogACT|agent|50000|2
+# ─── DOCS / KNOWLEDGE (cybersec / cloud / SDLC) ──────────────────────────────
+HuggingFaceFW/fineweb-edu|edu|130000000|1
+HuggingFaceFW/fineweb|web|150000000|2
+HuggingFaceFW/finepdfs|pdf|10000000|2
+HuggingFaceTB/cosmopedia-v2|edu|40000000|2
+HuggingFaceTB/cosmopedia|edu|31000000|2
+allenai/dolma|web|50000000|2
+allenai/peS2o|paper|40000000|2
+allenai/c4|web|10000000|2
+mlfoundations/dclm-baseline-1.0|web|10000000|2
+EleutherAI/proof-pile-2|math|30000000|2
+# ─── BUSINESS / GTM / FINANCE / LEGAL ─────────────────────────────────────────
+PatronusAI/financebench|finance|150|1
+AdaptLLM/finance-tasks|finance|50000|2
+zeroshot/twitter-financial-news-sentiment|finance|10000|3
+bitext/Bitext-customer-support-llm-chatbot-training-dataset|cs|26000|1
+nlpaueb/legal_qa|legal|10000|2
+casehold/casehold|legal|50000|2
+PatronusAI/legalbench|legal|10000|2
+# ─── MULTILINGUAL ─────────────────────────────────────────────────────────────
+CohereForAI/aya_dataset|multi|200000|2
+CohereForAI/aya_collection|multi|200000|2
+HuggingFaceH4/Multilingual-Thinking|reasoning|150000|2
+# ─── REASONING DISTILLATIONS (R1/o1-class) ────────────────────────────────────
+GAIR/LIMO|reasoning|820|1
+simplescaling/s1K|reasoning|1000|1
+simplescaling/s1K-1.1|reasoning|1000|1
+ServiceNow-AI/R1-Distill-SFT|reasoning|17000|1
+agentica-org/DeepScaleR-Preview-Dataset|reasoning|40000|1
+agentica-org/DeepCoder-Preview-Dataset|reasoning|24000|1
+KodCode/KodCode-V1|code|447000|1
+# ─── MEGA-MIX FOR LATER ───────────────────────────────────────────────────────
+allenai/tulu-3-sft-personas-instruction-following|sft|30000|2
+NeelNanda/pile-10k|web|10000|3

bin/v2/bulk-mirror-coordinator.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""Surrogate-1 v2 — Bulk mirror coordinator with claim queue.
+User feedback 2026-04-29:
+  "ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ หาจาก keyword แล้วไปที่ใหม่ๆ"
+This script is the work-claim broker: any number of mirror/discoverer/enricher
+agents can pull tasks from here. Each task = (dataset_repo, expected_size,
+priority). Claims persist in the central SQLite dedup store (already used by
+DedupStore for content dedup). Each claim has lease (15 min). Crashes auto-
+expire so other workers pick up.
+Usage from agents:
+  python3 bulk-mirror-coordinator.py claim          # → prints next task
+  python3 bulk-mirror-coordinator.py done <id>      # mark done
+  python3 bulk-mirror-coordinator.py status         # show queue + claimed
+  python3 bulk-mirror-coordinator.py seed           # one-time seed from massive list
+"""
+import os, sys, sqlite3, time, json
+from pathlib import Path
+DB_PATH = Path.home() / ".surrogate/state/bulk-mirror-claims.db"
+DB_PATH.parent.mkdir(parents=True, exist_ok=True)
+LIST_PATH = Path.home() / ".surrogate/bin/v2/bulk-datasets-massive.txt"
+LEASE_SECS = 15 * 60   # claim expires after 15 min if no `done` call
+def db():
+    c = sqlite3.connect(str(DB_PATH), isolation_level=None, timeout=10)
+    c.execute("PRAGMA journal_mode=WAL")
+    c.execute("""CREATE TABLE IF NOT EXISTS claims (
+        id INTEGER PRIMARY KEY AUTOINCREMENT,
+        repo_id TEXT UNIQUE,
+        category TEXT,
+        max_samples INTEGER,
+        priority INTEGER,
+        status TEXT DEFAULT 'pending',  -- pending / claimed / done / failed
+        worker_id TEXT,
+        claimed_at INTEGER,
+        completed_at INTEGER,
+        kept_count INTEGER DEFAULT 0,
+        error TEXT
+    )""")
+    c.execute("CREATE INDEX IF NOT EXISTS idx_claims_status_pri ON claims(status, priority)")
+    return c
+def seed():
+    """One-time: load massive list into queue."""
+    if not LIST_PATH.exists():
+        print(f"❌ {LIST_PATH} missing")
+        return
+    c = db()
+    n = 0
+    with open(LIST_PATH) as f:
+        for line in f:
+            line = line.strip()
+            if not line or line.startswith("#"):
+                continue
+            try:
+                repo, cat, mx, pri = line.split("|")
+                c.execute("""INSERT OR IGNORE INTO claims
+                             (repo_id, category, max_samples, priority)
+                             VALUES (?, ?, ?, ?)""",
+                          (repo.strip(), cat.strip(), int(mx), int(pri)))
+                if c.total_changes:
+                    n += 1
+            except Exception as e:
+                print(f"  skip {line[:60]}: {e}")
+    c.close()
+    print(f"✅ seeded {n} new entries (existing rows untouched)")
+def claim(worker_id: str | None = None):
+    """Atomically claim next pending task by priority."""
+    worker_id = worker_id or f"w-{os.getpid()}-{int(time.time())}"
+    c = db()
+    now = int(time.time())
+    # Expire stale claims first
+    c.execute("""UPDATE claims SET status='pending', worker_id=NULL
+                 WHERE status='claimed' AND claimed_at < ?""",
+              (now - LEASE_SECS,))
+    # Claim next pending in priority order
+    c.execute("""UPDATE claims
+                 SET status='claimed', worker_id=?, claimed_at=?
+                 WHERE id = (
+                     SELECT id FROM claims
+                     WHERE status='pending'
+                     ORDER BY priority ASC, RANDOM()
+                     LIMIT 1
+                 )
+                 RETURNING id, repo_id, category, max_samples, priority""",
+              (worker_id, now))
+    row = c.fetchone()
+    c.close()
+    if row:
+        cid, repo, cat, mx, pri = row
+        print(json.dumps({"id": cid, "repo_id": repo, "category": cat,
+                          "max_samples": mx, "priority": pri,
+                          "worker_id": worker_id}))
+    else:
+        print(json.dumps({"id": None, "msg": "no pending tasks"}))
+def done(claim_id: int, kept: int = 0, error: str | None = None):
+    c = db()
+    status = "failed" if error else "done"
+    c.execute("""UPDATE claims SET status=?, completed_at=?, kept_count=?, error=?
+                 WHERE id=?""",
+              (status, int(time.time()), kept, error, claim_id))
+    c.close()
+    print(json.dumps({"id": claim_id, "status": status, "kept": kept}))
+def status():
+    c = db()
+    cur = c.execute("""SELECT status, COUNT(*), SUM(kept_count)
+                       FROM claims GROUP BY status""")
+    print(f"{'status':<12} {'count':>6} {'kept_sum':>12}")
+    for s, n, k in cur:
+        print(f"{s:<12} {n:>6} {k or 0:>12}")
+    print()
+    cur = c.execute("""SELECT repo_id, status, kept_count, worker_id
+                       FROM claims
+                       WHERE status='claimed' OR status='failed'
+                       ORDER BY claimed_at DESC LIMIT 20""")
+    print(f"{'repo':<55} {'status':<10} {'kept':>8} {'worker':<20}")
+    for repo, s, k, w in cur:
+        print(f"{repo[:55]:<55} {s:<10} {k or 0:>8} {w or '-':<20}")
+    c.close()
+if __name__ == "__main__":
+    cmd = sys.argv[1] if len(sys.argv) > 1 else "status"
+    if cmd == "seed":
+        seed()
+    elif cmd == "claim":
+        worker_id = sys.argv[2] if len(sys.argv) > 2 else None
+        claim(worker_id)
+    elif cmd == "done":
+        cid = int(sys.argv[2])
+        kept = int(sys.argv[3]) if len(sys.argv) > 3 else 0
+        err = sys.argv[4] if len(sys.argv) > 4 else None
+        done(cid, kept, err)
+    elif cmd == "status":
+        status()
+    else:
+        print(f"unknown: {cmd}", file=sys.stderr)
+        sys.exit(1)

bin/v2/bulk-mirror-worker.sh ADDED Viewed

	@@ -0,0 +1,90 @@

+#!/usr/bin/env bash
+# Bulk mirror worker — claims dataset from coordinator, mirrors + enriches + uploads.
+# Spawn N of these on HF Space; each runs in its own loop, no duplication.
+#
+# Usage: bash bulk-mirror-worker.sh [worker_id]
+set -uo pipefail
+set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
+WORKER_ID="${1:-w-$$-$(date +%s)}"
+LOG="$HOME/.surrogate/logs/bulk-worker-${WORKER_ID}.log"
+mkdir -p "$(dirname "$LOG")"
+echo "[$(date +%H:%M:%S)] worker $WORKER_ID start" | tee -a "$LOG"
+# Loop forever, claiming + processing
+while true; do
+    # Claim next task
+    TASK=$(python3 "$HOME/.surrogate/bin/v2/bulk-mirror-coordinator.py" claim "$WORKER_ID")
+    REPO=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('repo_id') or '')")
+    CID=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('id') or '')")
+    MAX_N=$(echo "$TASK" | python3 -c "import sys, json; print(json.load(sys.stdin).get('max_samples') or 50000)")
+    if [[ -z "$REPO" || "$REPO" == "None" ]]; then
+        echo "[$(date +%H:%M:%S)] no work; sleep 5 min" | tee -a "$LOG"
+        sleep 300
+        continue
+    fi
+    echo "[$(date +%H:%M:%S)] $WORKER_ID claimed #$CID $REPO (max=$MAX_N)" | tee -a "$LOG"
+    # Run mirror (uses sanitizer + DedupStore + relevance filter via dataset-mirror.sh)
+    KEPT=0
+    ERROR=""
+    HF_TOKEN="$HF_TOKEN" python3 - "$REPO" "$MAX_N" <<'PYEOF' 2>>"$LOG" || ERROR="failed"
+import sys, os, json
+from pathlib import Path
+sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
+from sanitize import filter_pair
+try: from dedup import DedupStore; HAS_DEDUP = True
+except Exception: HAS_DEDUP = False
+repo, max_n = sys.argv[1], int(sys.argv[2])
+from datasets import load_dataset
+try:
+    ds = load_dataset(repo, split="train", streaming=True)
+except Exception as e:
+    print(f"  load fail: {e}")
+    print(f"KEPT=0")
+    sys.exit(0)
+import time as _t
+out_path = Path.home() / f".surrogate/data/bulk-mirror/{repo.replace('/','_')}.jsonl"
+out_path.parent.mkdir(parents=True, exist_ok=True)
+kept = 0
+with open(out_path, "w") as f:
+    for ex in ds:
+        if kept >= max_n: break
+        # Robust extraction
+        p = (ex.get("prompt") or ex.get("instruction") or ex.get("question")
+             or ex.get("input") or ex.get("query") or ex.get("text") or "")
+        r = (ex.get("response") or ex.get("answer") or ex.get("output")
+             or ex.get("completion") or ex.get("chosen") or "")
+        if (not p or not r) and isinstance(ex.get("messages"), list) and len(ex["messages"]) >= 2:
+            msgs = ex["messages"]
+            u = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("user","human") or m.get("from") in ("user","human")), "")
+            a = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("assistant","gpt") or m.get("from") in ("assistant","gpt")), "")
+            if u and a: p, r = u, a
+        if not p or not r: continue
+        p, r = str(p)[:6000].strip(), str(r)[:8000].strip()
+        if len(p) < 20 or len(r) < 30: continue
+        v = filter_pair(p, r)
+        if not v["keep"]: continue
+        if HAS_DEDUP and not DedupStore.is_new(p, source=f"bulk-{repo}"): continue
+        f.write(json.dumps({"prompt": p, "response": r, "source": repo}, ensure_ascii=False) + "\n")
+        kept += 1
+print(f"KEPT={kept}")
+PYEOF
+    # Parse KEPT from python output
+    KEPT=$(grep -oE "KEPT=[0-9]+" "$LOG" | tail -1 | cut -d= -f2)
+    KEPT=${KEPT:-0}
+    # Mark done in coordinator
+    python3 "$HOME/.surrogate/bin/v2/bulk-mirror-coordinator.py" done "$CID" "$KEPT" "$ERROR" >> "$LOG"
+    echo "[$(date +%H:%M:%S)] $WORKER_ID done #$CID kept=$KEPT" | tee -a "$LOG"
+    # Brief pause to be gentle on HF API
+    sleep 10
+done

start.sh CHANGED Viewed

@@ -325,6 +325,19 @@ echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (2 parallel DLs)" >> "$L
 nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
 echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
 # ── 7d. Train-ready pusher — disabled at boot for now. Caused Space
 #       RUNTIME_ERROR on first deployment (2026-04-29). Script kept at
 #       bin/train-ready-pusher.sh; launch manually after Space proves stable:

 nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
 echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
+# ── 7d. Bulk mirror coordinator + 4 parallel workers ────────────────────────
+# User feedback 2026-04-29: "ทุก agent ทำงานร่วมกัน และไม่ไปที่ซ้ำๆ".
+# Coordinator = SQLite claim queue (~/.surrogate/state/bulk-mirror-claims.db).
+# Workers each pull next pending dataset, mirror+sanitize+dedup, mark done.
+# 100+ massive datasets in bin/v2/bulk-datasets-massive.txt (code/security/SDLC/agent/etc).
+# Lease-based claims (15 min) — crashes auto-expire so other workers pick up.
+python3 ~/.surrogate/bin/v2/bulk-mirror-coordinator.py seed >> "$LOG_DIR/bulk-mirror-seed.log" 2>&1 || true
+for i in 1 2 3 4; do
+    nohup bash ~/.surrogate/bin/v2/bulk-mirror-worker.sh "bulk-w$i" \
+        > "$LOG_DIR/bulk-worker-$i.log" 2>&1 &
+done
+echo "[$(date +%H:%M:%S)] bulk-mirror coordinator + 4 workers started (100+ datasets queued)" >> "$LOG_DIR/boot.log"
 # ── 7d. Train-ready pusher — disabled at boot for now. Caused Space
 #       RUNTIME_ERROR on first deployment (2026-04-29). Script kept at
 #       bin/train-ready-pusher.sh; launch manually after Space proves stable: