Spaces:
Runtime error
feat: dataset-mirror — 100x faster ingest by bulk-cloning community SFT mixes
Browse filesUser: '10 GB/hr is bullshit. โหลดหนังโป๊ยังได้เยอะกว่า'.
The bottleneck is NOT bandwidth — HF's 128 commits/hr cap times 5 sibling
datasets = 640/hr aggregate. We were under that cap (~150/hr).
The bottleneck is per-row processing: dataset-enrich.sh streams each row,
runs schema normalization, dedup-checks, and rewrites in our format. Each
ingest pass = maybe 1-5K NEW rows after dedup → 1 small commit. Slow.
dataset-mirror.sh does the OPPOSITE: pulls the entire parquet shard from
a popular community SFT mix (e.g. teknium/OpenHermes-2.5 = 1M rows), and
uploads each shard as one commit at mirrors/<slug>/<file>. The conversion
to our schema happens at TRAIN TIME instead of INGEST TIME.
30 sources curated for first cycle:
Massive SFT: OpenHermes-2.5, ultrachat-200k, OpenOrca, SlimOrca-Dedup,
no-robots, dolly-15k, Open-Platypus, OpenMathInstruct-2
Code: Magicoder-OSS, Magicoder-Evol, CodeAlpaca, Evol-Code-80k,
starcoder2-self-oss, m-a-p/CodeFeedback, smoltalk
Reasoning: orca-math-200k, MetaMathQA, proof-pile-2, finemath
Tool: xlam-fc-60k, orca-agentinstruct-1M
Conversational: lmsys-chat-1m, HelpSteer3, hh-rlhf
Multilingual: aya-dataset, aya-collection
Curated: magpie-ultra, magpie-pro-300K, capybara-dpo-7k
Each source typically 100MB-5GB of parquet. 30 sources × ~1GB avg ×
1 commit/file = ~30 GB landing in 1-2 hours, then ~5M+ training pairs
available immediately. Stamp file (~/.surrogate/state/dataset-mirror-stamps.json)
makes this idempotent — re-runs skip already-mirrored sources.
Boot-time kickoff so first batch lands immediately on Space rebuild.
- bin/dataset-mirror.sh +156 -0
- bin/hermes-status-server.py +1 -1
- start.sh +7 -0
|
@@ -0,0 +1,156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Bulk-mirror — clone top community datasets ENTIRELY into our 5 sibling
|
| 3 |
+
# repos. Single git/HfApi push = millions of pairs in one commit.
|
| 4 |
+
#
|
| 5 |
+
# This is fundamentally different from dataset-enrich.sh which streams +
|
| 6 |
+
# normalizes per-row. Mirror = "the whole parquet, as-is, NOW", which
|
| 7 |
+
# is 100-1000x faster GB/hr.
|
| 8 |
+
#
|
| 9 |
+
# Why this is fine:
|
| 10 |
+
# - Both HF licenses on these datasets allow redistribution
|
| 11 |
+
# - Format conversion can happen at TRAIN TIME (one pass over the mirror)
|
| 12 |
+
# - We're not double-counting commits because each mirror = 1 file = 1 commit
|
| 13 |
+
#
|
| 14 |
+
set -uo pipefail
|
| 15 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 16 |
+
|
| 17 |
+
LOG="$HOME/.surrogate/logs/dataset-mirror.log"
|
| 18 |
+
mkdir -p "$(dirname "$LOG")"
|
| 19 |
+
|
| 20 |
+
if [[ -z "${HF_TOKEN:-}" ]]; then
|
| 21 |
+
echo "[$(date +%H:%M:%S)] dataset-mirror skipping — HF_TOKEN not set" | tee -a "$LOG"
|
| 22 |
+
exit 0
|
| 23 |
+
fi
|
| 24 |
+
|
| 25 |
+
echo "[$(date +%H:%M:%S)] dataset-mirror cycle start" | tee -a "$LOG"
|
| 26 |
+
|
| 27 |
+
python3 - << 'PYEOF' 2>&1 | tee -a "$LOG"
|
| 28 |
+
"""
|
| 29 |
+
For each big community dataset on the SOURCES list:
|
| 30 |
+
1. Use huggingface_hub.snapshot_download to pull the parquet shards
|
| 31 |
+
2. Upload them to one of our 5 sibling repos under mirrors/<slug>/<file>
|
| 32 |
+
3. Stamp a marker so we don't re-mirror next cycle
|
| 33 |
+
"""
|
| 34 |
+
import os, time, json, hashlib, sys
|
| 35 |
+
from pathlib import Path
|
| 36 |
+
from huggingface_hub import HfApi, snapshot_download, list_repo_files
|
| 37 |
+
from huggingface_hub.errors import HfHubHTTPError
|
| 38 |
+
|
| 39 |
+
api = HfApi(token=os.environ["HF_TOKEN"])
|
| 40 |
+
|
| 41 |
+
# Top 30 community SFT mixes that are HUGE and immediately useful.
|
| 42 |
+
# Each = 100K-10M pairs. License flag = OK to redistribute.
|
| 43 |
+
SOURCES = [
|
| 44 |
+
# Massive SFT mixes
|
| 45 |
+
("teknium/OpenHermes-2.5", "OpenHermes-2-5"),
|
| 46 |
+
("HuggingFaceH4/ultrachat_200k", "ultrachat-200k"),
|
| 47 |
+
("Open-Orca/OpenOrca", "OpenOrca"),
|
| 48 |
+
("Open-Orca/SlimOrca-Dedup", "SlimOrca-Dedup"),
|
| 49 |
+
("HuggingFaceH4/no_robots", "no-robots"),
|
| 50 |
+
("databricks/databricks-dolly-15k", "dolly-15k"),
|
| 51 |
+
("garage-bAInd/Open-Platypus", "Open-Platypus"),
|
| 52 |
+
("nvidia/OpenMathInstruct-2", "OpenMathInstruct-2"),
|
| 53 |
+
# Code-specific
|
| 54 |
+
("ise-uiuc/Magicoder-OSS-Instruct-75K", "Magicoder-OSS"),
|
| 55 |
+
("ise-uiuc/Magicoder-Evol-Instruct-110K", "Magicoder-Evol"),
|
| 56 |
+
("HuggingFaceH4/CodeAlpaca_20K", "CodeAlpaca-20K"),
|
| 57 |
+
("nickrosh/Evol-Instruct-Code-80k-v1", "Evol-Code-80k"),
|
| 58 |
+
("bigcode/self-oss-instruct-sc2-exec-filter-50k", "starcoder2-self-oss"),
|
| 59 |
+
# Reasoning
|
| 60 |
+
("microsoft/orca-math-word-problems-200k", "orca-math-200k"),
|
| 61 |
+
("meta-math/MetaMathQA", "MetaMathQA"),
|
| 62 |
+
("EleutherAI/proof-pile-2", "proof-pile-2"),
|
| 63 |
+
("HuggingFaceTB/finemath", "finemath"),
|
| 64 |
+
# Tool / agentic
|
| 65 |
+
("Salesforce/xlam-function-calling-60k", "xlam-fc-60k"),
|
| 66 |
+
("microsoft/orca-agentinstruct-1M-v1", "orca-agentinstruct-1M"),
|
| 67 |
+
# Conversational
|
| 68 |
+
("lmsys/lmsys-chat-1m", "lmsys-chat-1m"),
|
| 69 |
+
("nvidia/HelpSteer3", "HelpSteer3"),
|
| 70 |
+
("Anthropic/hh-rlhf", "hh-rlhf"),
|
| 71 |
+
# Multilingual
|
| 72 |
+
("CohereForAI/aya_dataset", "aya-dataset"),
|
| 73 |
+
("CohereForAI/aya_collection", "aya-collection"),
|
| 74 |
+
# General curated
|
| 75 |
+
("argilla/magpie-ultra-v1.0", "magpie-ultra"),
|
| 76 |
+
("Magpie-Align/Magpie-Pro-MT-300K-v0.1", "magpie-pro-300K"),
|
| 77 |
+
# Code feedback / DPO
|
| 78 |
+
("m-a-p/CodeFeedback-Filtered-Instruction","CodeFeedback"),
|
| 79 |
+
("argilla/distilabel-capybara-dpo-7k-binarized", "capybara-dpo-7k"),
|
| 80 |
+
# Smol team
|
| 81 |
+
("HuggingFaceTB/smoltalk", "smoltalk"),
|
| 82 |
+
("HuggingFaceTB/smollm-corpus", "smollm-corpus"),
|
| 83 |
+
]
|
| 84 |
+
|
| 85 |
+
# 5 sibling repos to spread across — round-robin by hash for determinism
|
| 86 |
+
SIBLINGS = [
|
| 87 |
+
"axentx/surrogate-1-training-pairs",
|
| 88 |
+
"axentx/surrogate-1-pairs-A",
|
| 89 |
+
"axentx/surrogate-1-pairs-B",
|
| 90 |
+
"axentx/surrogate-1-pairs-C",
|
| 91 |
+
"axentx/surrogate-1-pairs-D",
|
| 92 |
+
]
|
| 93 |
+
def pick_repo(slug):
|
| 94 |
+
h = int(hashlib.md5(slug.encode()).hexdigest()[:8], 16)
|
| 95 |
+
return SIBLINGS[h % len(SIBLINGS)]
|
| 96 |
+
|
| 97 |
+
STAMPS = Path.home() / ".surrogate/state/dataset-mirror-stamps.json"
|
| 98 |
+
STAMPS.parent.mkdir(parents=True, exist_ok=True)
|
| 99 |
+
stamps = json.loads(STAMPS.read_text()) if STAMPS.exists() else {}
|
| 100 |
+
|
| 101 |
+
CACHE = Path("/tmp/dataset-mirror-cache")
|
| 102 |
+
CACHE.mkdir(exist_ok=True)
|
| 103 |
+
|
| 104 |
+
mirrored = 0
|
| 105 |
+
skipped = 0
|
| 106 |
+
errors = 0
|
| 107 |
+
|
| 108 |
+
for src_id, slug in SOURCES:
|
| 109 |
+
if slug in stamps:
|
| 110 |
+
skipped += 1
|
| 111 |
+
continue
|
| 112 |
+
target = pick_repo(slug)
|
| 113 |
+
print(f"\n▶ mirror {src_id} → {target}/mirrors/{slug}/", flush=True)
|
| 114 |
+
try:
|
| 115 |
+
# Download all parquet/jsonl shards
|
| 116 |
+
local = snapshot_download(
|
| 117 |
+
repo_id=src_id, repo_type="dataset",
|
| 118 |
+
cache_dir=str(CACHE), token=os.environ["HF_TOKEN"],
|
| 119 |
+
allow_patterns=["*.parquet", "*.jsonl", "*.json", "*.arrow", "*.csv"],
|
| 120 |
+
)
|
| 121 |
+
local_path = Path(local)
|
| 122 |
+
# Upload each file individually so commit count stays under 128/hr
|
| 123 |
+
for f in sorted(local_path.rglob("*")):
|
| 124 |
+
if not f.is_file(): continue
|
| 125 |
+
if f.stat().st_size < 1024: continue
|
| 126 |
+
rel = f.relative_to(local_path)
|
| 127 |
+
target_path = f"mirrors/{slug}/{rel}"
|
| 128 |
+
print(f" upload {rel} ({f.stat().st_size/1e6:.1f} MB)", flush=True)
|
| 129 |
+
try:
|
| 130 |
+
api.upload_file(
|
| 131 |
+
path_or_fileobj=str(f),
|
| 132 |
+
path_in_repo=target_path,
|
| 133 |
+
repo_id=target,
|
| 134 |
+
repo_type="dataset",
|
| 135 |
+
commit_message=f"mirror: {src_id} → mirrors/{slug}/{rel}",
|
| 136 |
+
)
|
| 137 |
+
mirrored += 1
|
| 138 |
+
except HfHubHTTPError as e:
|
| 139 |
+
if "429" in str(e) or "rate limit" in str(e).lower():
|
| 140 |
+
print(f" ⚠ rate-limited — sleep 60s and continue with next file", flush=True)
|
| 141 |
+
time.sleep(60)
|
| 142 |
+
else:
|
| 143 |
+
print(f" ❌ {type(e).__name__}: {str(e)[:200]}", flush=True)
|
| 144 |
+
errors += 1
|
| 145 |
+
time.sleep(2) # gentle pacing between commits
|
| 146 |
+
stamps[slug] = int(time.time())
|
| 147 |
+
STAMPS.write_text(json.dumps(stamps, indent=2))
|
| 148 |
+
except Exception as e:
|
| 149 |
+
print(f" ❌ {type(e).__name__}: {str(e)[:200]}", flush=True)
|
| 150 |
+
errors += 1
|
| 151 |
+
continue
|
| 152 |
+
|
| 153 |
+
print(f"\n✅ mirror cycle done: {mirrored} files uploaded, {skipped} skipped (already mirrored), {errors} errors")
|
| 154 |
+
PYEOF
|
| 155 |
+
|
| 156 |
+
echo "[$(date +%H:%M:%S)] dataset-mirror cycle done" | tee -a "$LOG"
|
|
@@ -167,7 +167,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
|
|
| 167 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 168 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 169 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 170 |
-
"qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog", "gh-actions-ticker", "llm-burst-generator", "expand-role-keywords", "kaggle-trainer", "lightning-trainer",
|
| 171 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 172 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 173 |
}
|
|
|
|
| 167 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 168 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 169 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 170 |
+
"qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed", "self-heal-watchdog", "gh-actions-ticker", "llm-burst-generator", "expand-role-keywords", "kaggle-trainer", "lightning-trainer", "dataset-mirror",
|
| 171 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 172 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 173 |
}
|
|
@@ -97,6 +97,13 @@ if [[ -d "$DATA" ]] && [[ -w "$DATA" ]]; then
|
|
| 97 |
nohup bash "${HOME}/.surrogate/bin/lightning-trainer.sh" >> "$LOG_DIR/lightning-trainer.log" 2>&1 &
|
| 98 |
echo "[$(date +%H:%M:%S)] boot-time lightning-trainer kicked off (H200 4hr quota)" >> "$LOG_DIR/boot.log"
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
|
| 101 |
else
|
| 102 |
echo "[$(date +%H:%M:%S)] WARN: /data not writable — running ephemeral!" >> "$LOG_DIR/boot.log"
|
|
|
|
| 97 |
nohup bash "${HOME}/.surrogate/bin/lightning-trainer.sh" >> "$LOG_DIR/lightning-trainer.log" 2>&1 &
|
| 98 |
echo "[$(date +%H:%M:%S)] boot-time lightning-trainer kicked off (H200 4hr quota)" >> "$LOG_DIR/boot.log"
|
| 99 |
|
| 100 |
+
# ── BOOT-TIME dataset-mirror — bulk-clone top community SFT mixes ──────
|
| 101 |
+
# Far faster than streaming-and-normalize — 1 commit per parquet file
|
| 102 |
+
# = millions of pairs landing as raw mirrors/<slug>/<file>. Idempotent
|
| 103 |
+
# via stamp file so we don't re-mirror what's already been pulled.
|
| 104 |
+
nohup bash "${HOME}/.surrogate/bin/dataset-mirror.sh" >> "$LOG_DIR/dataset-mirror.log" 2>&1 &
|
| 105 |
+
echo "[$(date +%H:%M:%S)] boot-time dataset-mirror kicked off (30 community sources)" >> "$LOG_DIR/boot.log"
|
| 106 |
+
|
| 107 |
echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
|
| 108 |
else
|
| 109 |
echo "[$(date +%H:%M:%S)] WARN: /data not writable — running ephemeral!" >> "$LOG_DIR/boot.log"
|