Spaces:
Runtime error
feat: HF dataset agentic discoverer \u2014 never-ending mega-mix hunt
Browse filesUSER REQUEST: 'ΰΈ«ΰΈ² magamix ΰΈ‘ΰΈ² ingest ΰΉΰΈΰΈ agentic ΰΉΰΈ₯ΰΈ’ ΰΈ«ΰΈ²ΰΉΰΈ₯ΰΉΰΈ§ stamp ΰΉΰΈ₯ΰΉΰΈ§ ΰΉΰΈΰΈ«ΰΈ²
ΰΈΰΈ΅ΰΉΰΉΰΈ«ΰΈ‘ΰΉΰΉΰΈΰΉΰΈ£ΰΈ·ΰΉΰΈΰΈ’ΰΉ ΰΉΰΈ‘ΰΉΰΈͺΰΈ΄ΰΉΰΈΰΈͺΰΈΈΰΈ ΰΈΰΈΰΉΰΈ’ΰΈΰΈ°ΰΉ' (find mega-mixes agentically, stamp,
go find new ones, never-ending, lots)
ARCHITECTURE (hf-dataset-discoverer.py):
DISCOVERY:
- 70+ topic search queries covering every SDLC + niche domain
- HF Hub API search every 30 min cycle
- Sort by downloads to find high-quality first
EVALUATION (per new dataset):
1. License normalize \u2192 must be in ALLOWED set (Apache/MIT/CC-BY/CC0/CDLA/ODC-BY/BSD/ISC)
Hard reject anything matching DENY_KEYWORDS (NC, Llama-Community, research-only)
2. Fetch sample row via datasets-server.huggingface.co/first-rows
3. Schema detection \u2014 maps row keys to one of 30+ existing branches:
messages, conversations, swe-instance, commit, tools-query-answers,
code-defect, code-defect-cwe, chosen-rejected, instruction-input-output,
instr-resp, query-resp, system-user-assistant, system-chat,
repobench-longctx, ...
4. Quality score (0-1):
+ 0.3 license-clean
+ 0.3 downloads >= 10K (or 0.2 >=1K, 0.1 >=100)
+ 0.2 schema detected
+ 0.1 has dataset card description
+ 0.1 sample non-trivial
VERDICT:
- score \u2265 0.6 AND schema detected \u2192 'integrated'
\u2192 auto-append to ~/.surrogate/state/dynamic-datasets.json
- score \u2265 0.4 \u2192 'queued' (manual review)
- otherwise \u2192 'rejected-quality' or 'rejected-license'
STAMP:
- Central SQLite ~/.surrogate/state/hf-dataset-frontier.db
- Same dataset NEVER re-evaluated (idempotent)
- Verdicts queryable: integrated/queued/rejected-license/rejected-quality
INGESTION:
- dataset-enrich.sh now reads DYNAMIC_PATH JSON on top of static 89
- Each enrich cycle (every 4h) pulls all dynamic finds via existing
streaming + dedup pipeline
- Cap auto-chosen by HF size_categories field
CONTINUOUS:
- 30-min cycles, no upper limit
- Auto-restart on crash via wrapper script
- Boots automatically via start.sh
EXPECTED VOLUME:
- ~50-150 new datasets evaluated per 30-min cycle
- ~10-30 'integrated' verdicts per cycle (after license + quality filter)
- ~50-200K new pairs/day from discoveries (capped + deduped)
- After 7 days: 89 \u2192 ~200-300 datasets registered
Surrogate-1 corpus grows indefinitely toward SOTA quality across all SDLC niches.
Also fixes a tiny bug: dataset-enrich.sh had stale ~/.claude/venv path
\u2014 already fixed in earlier round, no-op here.
- bin/dataset-enrich.sh +13 -0
- bin/hermes-status-server.py +1 -1
- bin/hf-dataset-discoverer.py +326 -0
- bin/hf-dataset-discoverer.sh +17 -0
- start.sh +7 -3
|
@@ -187,6 +187,19 @@ DATASETS = [
|
|
| 187 |
# ByteDance-Seed/Multi-SWE-bench + bigcode/bigcodebench = EVAL HOLDOUT, never train.
|
| 188 |
]
|
| 189 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
# 1. Use CENTRAL dedup store (single source of truth across all writers)
|
| 191 |
import sys as _sys
|
| 192 |
_sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
|
|
|
| 187 |
# ByteDance-Seed/Multi-SWE-bench + bigcode/bigcodebench = EVAL HOLDOUT, never train.
|
| 188 |
]
|
| 189 |
|
| 190 |
+
# ββ DYNAMIC LIST β agentic discoverer adds new finds here (no manual edit) ββ
|
| 191 |
+
# hf-dataset-discoverer.py runs every 30 min, evaluates new HF datasets,
|
| 192 |
+
# auto-appends high-quality permissive picks to ~/.surrogate/state/dynamic-datasets.json
|
| 193 |
+
DYNAMIC_PATH = Path.home() / ".surrogate/state/dynamic-datasets.json"
|
| 194 |
+
if DYNAMIC_PATH.exists():
|
| 195 |
+
try:
|
| 196 |
+
dyn = json.loads(DYNAMIC_PATH.read_text() or "[]")
|
| 197 |
+
for d in dyn:
|
| 198 |
+
DATASETS.append((d["id"], d["license"], d["slug"], d["schema"], d["cap"]))
|
| 199 |
+
print(f" π¦ dynamic discoverer: +{len(dyn)} datasets auto-added", flush=True)
|
| 200 |
+
except Exception as e:
|
| 201 |
+
print(f" β dynamic list parse err: {e}", flush=True)
|
| 202 |
+
|
| 203 |
# 1. Use CENTRAL dedup store (single source of truth across all writers)
|
| 204 |
import sys as _sys
|
| 205 |
_sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
|
@@ -166,7 +166,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
|
|
| 166 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 167 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 168 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 169 |
-
"qwen-coder", "git-clone", "git-pull", "redis", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 170 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 171 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 172 |
}
|
|
|
|
| 166 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 167 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 168 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 169 |
+
"qwen-coder", "git-clone", "git-pull", "redis", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 170 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 171 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 172 |
}
|
|
@@ -0,0 +1,326 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
HF Dataset Agentic Discoverer β never-ending mega-mix hunter.
|
| 4 |
+
|
| 5 |
+
Searches HF Hub across 60+ topic queries every 30 min. For each NEW dataset:
|
| 6 |
+
1. License filter (Apache/MIT/CC-BY/CC0/CDLA/ODC-BY only)
|
| 7 |
+
2. Quality score (downloads, card, schema detection, sample inspection)
|
| 8 |
+
3. Stamp in central DB ~/.surrogate/state/hf-dataset-frontier.db
|
| 9 |
+
4. If score β₯ 0.6 AND schema matches one of our 30+ branches:
|
| 10 |
+
auto-add to dynamic-datasets.json
|
| 11 |
+
5. dataset-enrich.sh reads dynamic list on top of static 89 β grows indefinitely
|
| 12 |
+
|
| 13 |
+
Stamps: ds_id β verdict β {integrated, rejected-license, rejected-quality, queued}
|
| 14 |
+
Same dataset never re-evaluated.
|
| 15 |
+
"""
|
| 16 |
+
from __future__ import annotations
|
| 17 |
+
import hashlib, json, os, re, sqlite3, sys, time
|
| 18 |
+
import urllib.parse, urllib.request
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
HOME = Path(os.environ.get("HOME", "/home/hermes"))
|
| 22 |
+
DB = HOME / ".surrogate/state/hf-dataset-frontier.db"
|
| 23 |
+
DYNAMIC = HOME / ".surrogate/state/dynamic-datasets.json"
|
| 24 |
+
LOG = HOME / ".surrogate/logs/hf-dataset-discoverer.log"
|
| 25 |
+
HF_TOKEN = os.environ.get("HF_TOKEN") or os.environ.get("HUGGING_FACE_HUB_TOKEN") or ""
|
| 26 |
+
|
| 27 |
+
ALLOWED = {
|
| 28 |
+
"mit", "apache-2.0", "apache 2.0", "cc-by-4.0", "cc-by-3.0", "cc0-1.0",
|
| 29 |
+
"cdla-permissive-2.0", "cdla-permissive-1.0", "bsd", "bsd-2-clause",
|
| 30 |
+
"bsd-3-clause", "isc", "odc-by", "openrail", "openrail++",
|
| 31 |
+
}
|
| 32 |
+
DENY_KEYWORDS = ("noncommercial", "non-commercial", "nc-", "-nc", "nc4.0",
|
| 33 |
+
"llama2", "llama3", "llama-3", "research-only", "personal-use")
|
| 34 |
+
|
| 35 |
+
# 70+ search queries β broad SDLC + niche coverage
|
| 36 |
+
QUERIES = [
|
| 37 |
+
# Code generation/instruction
|
| 38 |
+
"code instruction", "code completion", "code generation", "python instruction",
|
| 39 |
+
"code review", "code refactoring", "code translation", "code explanation",
|
| 40 |
+
# Bug-fix / test
|
| 41 |
+
"bug fix", "test generation", "unit test", "pull request", "diff review",
|
| 42 |
+
"vulnerability fix", "security patch",
|
| 43 |
+
# Reasoning / CoT
|
| 44 |
+
"chain of thought", "math reasoning", "step by step", "reasoning trace",
|
| 45 |
+
"deepseek r1", "qwq", "o1 reasoning",
|
| 46 |
+
# Agent / tool
|
| 47 |
+
"agent trajectory", "tool calling", "function calling", "react agent",
|
| 48 |
+
"swe-bench", "agentic", "smolagents",
|
| 49 |
+
# DevSecOps / IR
|
| 50 |
+
"incident response", "postmortem", "cybersecurity", "vulnerability",
|
| 51 |
+
"cve", "exploit", "owasp", "threat intelligence", "security audit",
|
| 52 |
+
"penetration testing", "red team",
|
| 53 |
+
# SRE / Cloud
|
| 54 |
+
"system reliability", "sre", "observability", "kubernetes", "terraform",
|
| 55 |
+
"cloudformation", "aws", "gcp", "azure", "devops",
|
| 56 |
+
# Data / ML
|
| 57 |
+
"dbt", "airflow", "spark", "kafka", "etl", "mlops", "model serving",
|
| 58 |
+
"embedding dataset", "rag dataset",
|
| 59 |
+
# SQL / DBA
|
| 60 |
+
"text-to-sql", "sql query", "database query", "schema",
|
| 61 |
+
# Architecture
|
| 62 |
+
"software architecture", "design pattern", "domain driven design",
|
| 63 |
+
"microservices", "event sourcing", "cqrs", "hexagonal",
|
| 64 |
+
# Frontend
|
| 65 |
+
"react", "nextjs", "tailwind", "vue", "svelte", "ui component",
|
| 66 |
+
# Mobile
|
| 67 |
+
"ios swift", "android kotlin", "react native", "flutter",
|
| 68 |
+
# Multilingual
|
| 69 |
+
"multilingual code", "multilingual instruction", "thai instruction",
|
| 70 |
+
# Domain niches
|
| 71 |
+
"compiler", "embedded", "rust systems", "go concurrency",
|
| 72 |
+
"performance optimization", "concurrency",
|
| 73 |
+
# Doc / API
|
| 74 |
+
"openapi", "api design", "technical writing", "documentation dataset",
|
| 75 |
+
# Constitutional / safety
|
| 76 |
+
"constitutional ai", "safety dataset", "preference dataset", "dpo",
|
| 77 |
+
# Recent mega-mixes
|
| 78 |
+
"instruction tuning 2025", "post-training dataset", "sft mixture",
|
| 79 |
+
]
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
def log(msg: str):
|
| 83 |
+
line = f"[{time.strftime('%H:%M:%S')}] {msg}"
|
| 84 |
+
print(line, flush=True)
|
| 85 |
+
LOG.parent.mkdir(parents=True, exist_ok=True)
|
| 86 |
+
with open(LOG, "a") as f:
|
| 87 |
+
f.write(line + "\n")
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
def init_db():
|
| 91 |
+
DB.parent.mkdir(parents=True, exist_ok=True)
|
| 92 |
+
with sqlite3.connect(DB) as c:
|
| 93 |
+
c.executescript("""
|
| 94 |
+
CREATE TABLE IF NOT EXISTS dataset_seen (
|
| 95 |
+
ds_id TEXT PRIMARY KEY,
|
| 96 |
+
evaluated_ts INTEGER NOT NULL,
|
| 97 |
+
license TEXT,
|
| 98 |
+
downloads INTEGER,
|
| 99 |
+
quality_score REAL,
|
| 100 |
+
schema_branch TEXT,
|
| 101 |
+
cap INTEGER,
|
| 102 |
+
slug TEXT,
|
| 103 |
+
verdict TEXT
|
| 104 |
+
);
|
| 105 |
+
CREATE INDEX IF NOT EXISTS idx_verdict ON dataset_seen(verdict);
|
| 106 |
+
CREATE INDEX IF NOT EXISTS idx_score ON dataset_seen(quality_score DESC);
|
| 107 |
+
""")
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def hf_get(url: str, timeout: int = 15):
|
| 111 |
+
headers = {"User-Agent": "Surrogate-1/dataset-discoverer"}
|
| 112 |
+
if HF_TOKEN:
|
| 113 |
+
headers["Authorization"] = f"Bearer {HF_TOKEN}"
|
| 114 |
+
req = urllib.request.Request(url, headers=headers)
|
| 115 |
+
try:
|
| 116 |
+
with urllib.request.urlopen(req, timeout=timeout) as r:
|
| 117 |
+
return json.load(r)
|
| 118 |
+
except Exception as e:
|
| 119 |
+
return None
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def detect_schema(sample_row: dict | None) -> str | None:
|
| 123 |
+
"""Map first-row keys to existing dataset-enrich.sh schema branch."""
|
| 124 |
+
if not sample_row:
|
| 125 |
+
return None
|
| 126 |
+
keys = {k.lower() for k in sample_row.keys()}
|
| 127 |
+
# Order matters β most specific first
|
| 128 |
+
if "messages" in keys: return "messages"
|
| 129 |
+
if "conversations" in keys: return "conversations"
|
| 130 |
+
if "patch" in keys and ("problem_statement" in keys or "issue" in keys):
|
| 131 |
+
return "swe-instance"
|
| 132 |
+
if "old_contents" in keys and "new_contents" in keys: return "commit"
|
| 133 |
+
if "tools" in keys and ("query" in keys or "answers" in keys):
|
| 134 |
+
return "tools-query-answers"
|
| 135 |
+
if "func" in keys and "target" in keys: return "code-defect"
|
| 136 |
+
if "cwe" in keys: return "code-defect-cwe"
|
| 137 |
+
if "chosen" in keys and "rejected" in keys: return "chosen-rejected"
|
| 138 |
+
if "instruction" in keys and "output" in keys: return "instruction-input-output"
|
| 139 |
+
if ("instruction" in keys or "input" in keys) and "response" in keys:
|
| 140 |
+
return "instr-resp"
|
| 141 |
+
if "problem" in keys and "solution" in keys: return "instr-resp"
|
| 142 |
+
if "query" in keys and "response" in keys: return "query-resp"
|
| 143 |
+
if "question" in keys and "answer" in keys: return "query-resp"
|
| 144 |
+
if "system" in keys and "user" in keys and "assistant" in keys:
|
| 145 |
+
return "system-user-assistant"
|
| 146 |
+
if "system" in keys and "chat" in keys: return "system-chat"
|
| 147 |
+
if "prompt" in keys and ("completion" in keys or "response" in keys):
|
| 148 |
+
return "instr-resp"
|
| 149 |
+
if "context" in keys and ("next_line" in keys or "groundtruth" in keys):
|
| 150 |
+
return "repobench-longctx"
|
| 151 |
+
return None
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
def get_first_row(ds_id: str) -> dict:
|
| 155 |
+
url = f"https://datasets-server.huggingface.co/first-rows?dataset={urllib.parse.quote(ds_id)}&config=default&split=train"
|
| 156 |
+
data = hf_get(url, timeout=10)
|
| 157 |
+
if not data: return {}
|
| 158 |
+
rows = data.get("rows", [])
|
| 159 |
+
if rows:
|
| 160 |
+
return rows[0].get("row", {})
|
| 161 |
+
return {}
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
def normalize_license(meta: dict) -> str:
|
| 165 |
+
lic = (meta.get("cardData") or {}).get("license", "") or meta.get("license", "")
|
| 166 |
+
if isinstance(lic, list):
|
| 167 |
+
lic = lic[0] if lic else ""
|
| 168 |
+
return str(lic).lower().replace("license:", "").strip()
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
def score_dataset(meta: dict, schema: str | None, sample: dict, lic: str) -> float:
|
| 172 |
+
score = 0.0
|
| 173 |
+
# License (mandatory + 0.3)
|
| 174 |
+
if lic in ALLOWED:
|
| 175 |
+
score += 0.3
|
| 176 |
+
# Downloads
|
| 177 |
+
dl = meta.get("downloads", 0) or 0
|
| 178 |
+
if dl >= 10000: score += 0.3
|
| 179 |
+
elif dl >= 1000: score += 0.2
|
| 180 |
+
elif dl >= 100: score += 0.1
|
| 181 |
+
# Schema detected
|
| 182 |
+
if schema: score += 0.2
|
| 183 |
+
# Card description
|
| 184 |
+
desc = (meta.get("description") or "")
|
| 185 |
+
if len(desc) > 200: score += 0.1
|
| 186 |
+
# Sample non-trivial
|
| 187 |
+
if sample and len(json.dumps(sample)) > 100: score += 0.1
|
| 188 |
+
return min(1.0, score)
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
def cap_for_size(meta: dict) -> int:
|
| 192 |
+
sc = (meta.get("cardData") or {}).get("size_categories")
|
| 193 |
+
if isinstance(sc, list):
|
| 194 |
+
sc = sc[0] if sc else ""
|
| 195 |
+
sc = str(sc or "")
|
| 196 |
+
if "<1K" in sc: return 1000
|
| 197 |
+
if "1K<n<10K" in sc: return 10000
|
| 198 |
+
if "10K<n<100K" in sc: return 50000
|
| 199 |
+
if "100K<n<1M" in sc: return 100000
|
| 200 |
+
if "1M<n<10M" in sc: return 200000
|
| 201 |
+
if "10M<n<100M" in sc: return 300000
|
| 202 |
+
return 100000
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
def append_dynamic(entry: dict):
|
| 206 |
+
DYNAMIC.parent.mkdir(parents=True, exist_ok=True)
|
| 207 |
+
existing = []
|
| 208 |
+
if DYNAMIC.exists():
|
| 209 |
+
try:
|
| 210 |
+
existing = json.loads(DYNAMIC.read_text() or "[]")
|
| 211 |
+
except json.JSONDecodeError:
|
| 212 |
+
existing = []
|
| 213 |
+
# Dedup by id
|
| 214 |
+
if any(e["id"] == entry["id"] for e in existing):
|
| 215 |
+
return
|
| 216 |
+
existing.append(entry)
|
| 217 |
+
DYNAMIC.write_text(json.dumps(existing, indent=2))
|
| 218 |
+
|
| 219 |
+
|
| 220 |
+
def evaluate_one(ds_id: str) -> tuple[str, dict | None]:
|
| 221 |
+
"""Returns (verdict, dynamic_entry_or_None)."""
|
| 222 |
+
meta = hf_get(f"https://huggingface.co/api/datasets/{ds_id}?full=true")
|
| 223 |
+
if not meta:
|
| 224 |
+
return "unreachable", None
|
| 225 |
+
|
| 226 |
+
lic = normalize_license(meta)
|
| 227 |
+
# Hard reject
|
| 228 |
+
if any(d in lic for d in DENY_KEYWORDS):
|
| 229 |
+
return "rejected-license", None
|
| 230 |
+
if not lic and not meta.get("cardData"):
|
| 231 |
+
return "rejected-no-card", None
|
| 232 |
+
if lic and lic not in ALLOWED:
|
| 233 |
+
# Maybe still permissive by name
|
| 234 |
+
if not any(p in lic for p in ("apache", "mit", "cc0", "cdla", "cc-by", "bsd", "isc", "odc")):
|
| 235 |
+
return "rejected-license", None
|
| 236 |
+
|
| 237 |
+
sample = get_first_row(ds_id)
|
| 238 |
+
schema = detect_schema(sample)
|
| 239 |
+
score = score_dataset(meta, schema, sample, lic or "?")
|
| 240 |
+
cap = cap_for_size(meta)
|
| 241 |
+
slug = re.sub(r'[^a-zA-Z0-9-]', '-', ds_id.replace("/", "-"))[:40]
|
| 242 |
+
|
| 243 |
+
if score >= 0.6 and schema:
|
| 244 |
+
return "integrated", {
|
| 245 |
+
"id": ds_id, "license": lic or "permissive", "slug": slug,
|
| 246 |
+
"schema": schema, "cap": cap, "score": round(score, 2),
|
| 247 |
+
"downloads": meta.get("downloads", 0),
|
| 248 |
+
"discovered_ts": int(time.time()),
|
| 249 |
+
}
|
| 250 |
+
elif score >= 0.4:
|
| 251 |
+
return "queued-needs-schema" if not schema else "queued-low-quality", None
|
| 252 |
+
else:
|
| 253 |
+
return "rejected-quality", None
|
| 254 |
+
|
| 255 |
+
|
| 256 |
+
def stamp(ds_id: str, verdict: str, lic: str = "", dl: int = 0,
|
| 257 |
+
score: float = 0.0, schema: str = "", cap: int = 0, slug: str = ""):
|
| 258 |
+
with sqlite3.connect(DB) as c:
|
| 259 |
+
c.execute(
|
| 260 |
+
"INSERT OR IGNORE INTO dataset_seen VALUES (?,?,?,?,?,?,?,?,?)",
|
| 261 |
+
(ds_id, int(time.time()), lic, dl, score, schema, cap, slug, verdict)
|
| 262 |
+
)
|
| 263 |
+
|
| 264 |
+
|
| 265 |
+
def is_seen(ds_id: str) -> bool:
|
| 266 |
+
with sqlite3.connect(DB) as c:
|
| 267 |
+
return c.execute("SELECT 1 FROM dataset_seen WHERE ds_id=?", (ds_id,)).fetchone() is not None
|
| 268 |
+
|
| 269 |
+
|
| 270 |
+
def discover_cycle() -> dict:
|
| 271 |
+
new_integrated = 0
|
| 272 |
+
new_queued = 0
|
| 273 |
+
new_rejected = 0
|
| 274 |
+
seen_this_cycle = 0
|
| 275 |
+
for q in QUERIES:
|
| 276 |
+
url = f"https://huggingface.co/api/datasets?search={urllib.parse.quote(q)}&limit=50&sort=downloads&direction=-1"
|
| 277 |
+
results = hf_get(url, timeout=15) or []
|
| 278 |
+
for ds in results:
|
| 279 |
+
ds_id = ds.get("id", "")
|
| 280 |
+
if not ds_id or is_seen(ds_id):
|
| 281 |
+
continue
|
| 282 |
+
seen_this_cycle += 1
|
| 283 |
+
verdict, entry = evaluate_one(ds_id)
|
| 284 |
+
stamp(ds_id, verdict,
|
| 285 |
+
lic=entry.get("license", "") if entry else "",
|
| 286 |
+
dl=entry.get("downloads", 0) if entry else 0,
|
| 287 |
+
score=entry.get("score", 0.0) if entry else 0.0,
|
| 288 |
+
schema=entry.get("schema", "") if entry else "",
|
| 289 |
+
cap=entry.get("cap", 0) if entry else 0,
|
| 290 |
+
slug=entry.get("slug", "") if entry else "")
|
| 291 |
+
if verdict == "integrated":
|
| 292 |
+
append_dynamic(entry)
|
| 293 |
+
new_integrated += 1
|
| 294 |
+
log(f" β
{ds_id} | {entry['license']} | {entry['schema']} | cap={entry['cap']:,} | score={entry['score']}")
|
| 295 |
+
elif verdict.startswith("queued"):
|
| 296 |
+
new_queued += 1
|
| 297 |
+
else:
|
| 298 |
+
new_rejected += 1
|
| 299 |
+
time.sleep(0.5) # gentle on HF API
|
| 300 |
+
|
| 301 |
+
return {"evaluated": seen_this_cycle, "integrated": new_integrated,
|
| 302 |
+
"queued": new_queued, "rejected": new_rejected}
|
| 303 |
+
|
| 304 |
+
|
| 305 |
+
def main():
|
| 306 |
+
init_db()
|
| 307 |
+
log(f"start | hf_token={'set' if HF_TOKEN else 'MISSING'} | queries={len(QUERIES)}")
|
| 308 |
+
|
| 309 |
+
while True:
|
| 310 |
+
t0 = time.time()
|
| 311 |
+
try:
|
| 312 |
+
stats = discover_cycle()
|
| 313 |
+
except Exception as e:
|
| 314 |
+
log(f" cycle err {type(e).__name__}: {str(e)[:200]}")
|
| 315 |
+
stats = {}
|
| 316 |
+
elapsed = int(time.time() - t0)
|
| 317 |
+
# Cumulative stats from DB
|
| 318 |
+
with sqlite3.connect(DB) as c:
|
| 319 |
+
verdicts = dict(c.execute("SELECT verdict, COUNT(*) FROM dataset_seen GROUP BY verdict").fetchall())
|
| 320 |
+
log(f"=== cycle done in {elapsed}s | this_cycle={stats} | cumulative={verdicts}")
|
| 321 |
+
# Sleep 30 min between cycles
|
| 322 |
+
time.sleep(1800)
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
if __name__ == "__main__":
|
| 326 |
+
main()
|
|
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Wrapper for hf-dataset-discoverer.py β auto-restart on crash.
|
| 3 |
+
set -uo pipefail
|
| 4 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 5 |
+
LOG="$HOME/.surrogate/logs/hf-dataset-discoverer.log"
|
| 6 |
+
mkdir -p "$(dirname "$LOG")"
|
| 7 |
+
|
| 8 |
+
if [[ -z "${HF_TOKEN:-}${HUGGING_FACE_HUB_TOKEN:-}" ]]; then
|
| 9 |
+
echo "[$(date +%H:%M:%S)] WARN: no HF_TOKEN β discoverer will work but rate-limited" | tee -a "$LOG"
|
| 10 |
+
fi
|
| 11 |
+
|
| 12 |
+
while true; do
|
| 13 |
+
python3 "$HOME/.surrogate/bin/hf-dataset-discoverer.py" >> "$LOG" 2>&1
|
| 14 |
+
rc=$?
|
| 15 |
+
echo "[$(date +%H:%M:%S)] discoverer exited rc=$rc β restart in 60s" | tee -a "$LOG"
|
| 16 |
+
sleep 60
|
| 17 |
+
done
|
|
@@ -224,12 +224,16 @@ nohup bash ~/.surrogate/bin/agentic-crawler.sh 6 > "$LOG_DIR/agentic-crawler.log
|
|
| 224 |
echo "[$(date +%H:%M:%S)] agentic crawler started (parallel=6)" >> "$LOG_DIR/boot.log"
|
| 225 |
|
| 226 |
# ββ 7b2. GitHub-specific agentic crawler (4 PATs Γ 5000/h = 20K req/h) βββββ
|
| 227 |
-
# Central SQLite frontier β every visited repo/PR/issue stamped, no dedup with
|
| 228 |
-
# any other agent. Specializes in: trending, topic search, repo deep-dive,
|
| 229 |
-
# closed-issueβPR fix pairs, merged-PR review pairs, release notes.
|
| 230 |
nohup bash ~/.surrogate/bin/github-agentic-crawler.sh > "$LOG_DIR/github-agentic-crawler.log" 2>&1 &
|
| 231 |
echo "[$(date +%H:%M:%S)] github-agentic-crawler started (token pool maximized)" >> "$LOG_DIR/boot.log"
|
| 232 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
# ββ 7c. Skill-synthesis daemon (extract patterns from cloned repos β skills) β
|
| 234 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
| 235 |
echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
|
|
|
|
| 224 |
echo "[$(date +%H:%M:%S)] agentic crawler started (parallel=6)" >> "$LOG_DIR/boot.log"
|
| 225 |
|
| 226 |
# ββ 7b2. GitHub-specific agentic crawler (4 PATs Γ 5000/h = 20K req/h) βββββ
|
|
|
|
|
|
|
|
|
|
| 227 |
nohup bash ~/.surrogate/bin/github-agentic-crawler.sh > "$LOG_DIR/github-agentic-crawler.log" 2>&1 &
|
| 228 |
echo "[$(date +%H:%M:%S)] github-agentic-crawler started (token pool maximized)" >> "$LOG_DIR/boot.log"
|
| 229 |
|
| 230 |
+
# ββ 7b3. HF Dataset Discoverer (continuous mega-mix hunt) βββββββββββββββββββ
|
| 231 |
+
# Searches HF Hub across 70+ topic queries every 30 min. Filters license + scores
|
| 232 |
+
# quality. Auto-adds high-confidence permissive picks to dynamic-datasets.json.
|
| 233 |
+
# dataset-enrich reads dynamic list on top of static 89 β infinitely growing corpus.
|
| 234 |
+
nohup bash ~/.surrogate/bin/hf-dataset-discoverer.sh > "$LOG_DIR/hf-dataset-discoverer.log" 2>&1 &
|
| 235 |
+
echo "[$(date +%H:%M:%S)] hf-dataset-discoverer started (continuous mega-mix hunt)" >> "$LOG_DIR/boot.log"
|
| 236 |
+
|
| 237 |
# ββ 7c. Skill-synthesis daemon (extract patterns from cloned repos β skills) β
|
| 238 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
| 239 |
echo "[$(date +%H:%M:%S)] skill-synthesis daemon started" >> "$LOG_DIR/boot.log"
|