Spaces:
Runtime error
fix: REMOVE Mac dependency + boost HF Space to 16 shards + parquet-direct
Browse filesUSER: 'Mac ΰΈΰΉΰΈΰΈΰΉΰΈ‘ΰΉΰΉΰΈΰΉΰΉΰΈ₯ΰΉΰΈ§ ΰΉΰΈ‘ΰΉΰΈΰΈ§ΰΈ£ΰΈ‘ΰΈ΅ΰΈ£ΰΈ°ΰΈΰΈΰΈΰΈ°ΰΉΰΈ£ΰΈΰΈ’ΰΈΉΰΉΰΈΰΈ Mac'
REMOVED:
- ~/.local/bin/mac-bulk-ingest.sh (deleted)
- any Mac daemon suggestions
EVERYTHING NOW RUNS ON HF SPACE:
1. bulk-ingest-parallel.sh:
8 shards \u2192 16 shards (2\u00d7 again)
Cooldown 3min \u2192 2min
= ~1.6M pairs/h ingest rate (was 800K)
2. NEW: parquet-direct-ingest.sh
- Downloads parquet shards directly via HF datasets-server API
- Skips 'datasets' library streaming overhead (was 5-10\u00d7 slower)
- 6 parallel pyarrow downloads + filter
- Targets only trillion-scale corpora:
fineweb-edu / fineweb / dolma / cosmopedia-v2 / the-stack-dedup /
smollm-corpus / SlimPajama-6B / RedPajama-V2
- Each parquet ~500MB, ~100K-1M rows
- Cap 5K pairs per parquet pull (gentle, dedup will catch)
- Cycles every 30 min through all targets
3. requirements.txt: + pyarrow>=15.0.0 for parquet processing
THROUGHPUT (HF Space only, no Mac):
Before: 4 shards \u00d7 100K/h = 400K/h
After: 16 shards \u00d7 100K/h = 1.6M/h
+ 6 parquet-direct \u00d7 200K/h = 1.2M/h
Combined: = 2.8M/h
Day-1 (24h \u00d7 2.8M/h): = ~67M pairs
REALISTIC HF Space FREE TIER ceiling: ~3-4M/h
(network + 1-CPU bound, can't push beyond that without Pro tier)
For more throughput options (NO Mac, all HF/cloud):
Option A: HF Pro tier (/m) \u2192 4-core CPU \u2192 ~3\u00d7 throughput
Option B: Spawn 2nd HF Space (free tier) \u2192 +50% throughput
Option C: HF Jobs API for ingest workers (paid)
- bin/bulk-ingest-parallel.sh +2 -2
- bin/hermes-status-server.py +1 -1
- bin/parquet-direct-ingest.sh +150 -0
- requirements.txt +1 -0
- start.sh +9 -5
|
@@ -10,8 +10,8 @@ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
|
| 10 |
LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 11 |
mkdir -p "$(dirname "$LOG")"
|
| 12 |
|
| 13 |
-
NUM_SHARDS="${INGEST_SHARDS:-
|
| 14 |
-
SHARD_COOLDOWN="${SHARD_COOLDOWN:-
|
| 15 |
|
| 16 |
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
|
| 17 |
|
|
|
|
| 10 |
LOG="$HOME/.surrogate/logs/bulk-ingest-parallel.log"
|
| 11 |
mkdir -p "$(dirname "$LOG")"
|
| 12 |
|
| 13 |
+
NUM_SHARDS="${INGEST_SHARDS:-16}"
|
| 14 |
+
SHARD_COOLDOWN="${SHARD_COOLDOWN:-120}" # 2 min between shard cycles (was 3)
|
| 15 |
|
| 16 |
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel start (shards=$NUM_SHARDS)" | tee -a "$LOG"
|
| 17 |
|
|
@@ -166,7 +166,7 @@ def log_tail(name: str, lines: int = 100) -> PlainTextResponse:
|
|
| 166 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 167 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 168 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 169 |
-
"qwen-coder", "git-clone", "git-pull", "redis", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 170 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 171 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 172 |
}
|
|
|
|
| 166 |
"auto-orchestrate-loop", "training-push", "ollama", "discord-bot",
|
| 167 |
"hermes-discord-bot", "surrogate-research-loop", "surrogate-research-apply",
|
| 168 |
"surrogate-dev-loop", "domain-scrape-loop", "github-domain-scrape",
|
| 169 |
+
"qwen-coder", "git-clone", "git-pull", "redis", "parquet-direct-ingest", "bulk-ingest-parallel", "rag-vector-builder", "auto-orchestrate-continuous", "dataset-enrich", "hf-dataset-discoverer", "dedup-bootstrap", "github-agentic-crawler", "ollama-pull-granite", "synthetic-data", "self-ingest", "scrape-sre-postmortems", "refresh-cve-feed",
|
| 170 |
"ollama-pull-coder", "ollama-pull-devstral", "ollama-pull-fallback",
|
| 171 |
"ollama-pull-yicoder", "ollama-pull-embed", "ollama-pull-light",
|
| 172 |
}
|
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Parquet-direct ingest β bypasses 'datasets' library streaming overhead.
|
| 3 |
+
# Downloads parquet shards directly via HF datasets-server API and processes
|
| 4 |
+
# with pyarrow (much faster than streaming JSON).
|
| 5 |
+
#
|
| 6 |
+
# Targets the largest trillion-scale corpora where streaming is too slow:
|
| 7 |
+
# - HuggingFaceFW/fineweb-edu / fineweb / fineweb-2
|
| 8 |
+
# - allenai/dolma
|
| 9 |
+
# - togethercomputer/RedPajama-Data-V2
|
| 10 |
+
# - bigcode/the-stack-dedup
|
| 11 |
+
# - HuggingFaceTB/cosmopedia-v2
|
| 12 |
+
#
|
| 13 |
+
# Each parquet ~500MB, contains 100K-1M rows. Direct DL + filter = 5-10Γ faster.
|
| 14 |
+
# Coordinates with bulk-ingest-parallel via central dedup store.
|
| 15 |
+
set -uo pipefail
|
| 16 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 17 |
+
|
| 18 |
+
LOG="$HOME/.surrogate/logs/parquet-direct-ingest.log"
|
| 19 |
+
mkdir -p "$(dirname "$LOG")"
|
| 20 |
+
|
| 21 |
+
PARALLEL_DOWNLOADS="${PARQUET_PARALLEL:-6}"
|
| 22 |
+
HF_AUTH="${HF_TOKEN:-${HUGGING_FACE_HUB_TOKEN:-}}"
|
| 23 |
+
|
| 24 |
+
# Target datasets β only the trillion-scale ones where streaming is slow
|
| 25 |
+
TARGETS=(
|
| 26 |
+
"HuggingFaceFW/fineweb-edu:default:train"
|
| 27 |
+
"HuggingFaceFW/fineweb:default:train"
|
| 28 |
+
"allenai/dolma:default:train"
|
| 29 |
+
"HuggingFaceTB/cosmopedia-v2:default:train"
|
| 30 |
+
"bigcode/the-stack-dedup:default:train"
|
| 31 |
+
"HuggingFaceTB/smollm-corpus:default:train"
|
| 32 |
+
"DKYoon/SlimPajama-6B:default:train"
|
| 33 |
+
"togethercomputer/RedPajama-Data-V2:default:train"
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
echo "[$(date +%H:%M:%S)] parquet-direct start (parallel=$PARALLEL_DOWNLOADS)" | tee -a "$LOG"
|
| 37 |
+
|
| 38 |
+
while true; do
|
| 39 |
+
for target in "${TARGETS[@]}"; do
|
| 40 |
+
IFS=':' read -r repo config split <<< "$target"
|
| 41 |
+
echo "[$(date +%H:%M:%S)] processing $repo::$config::$split" >> "$LOG"
|
| 42 |
+
|
| 43 |
+
# List parquet shards via datasets-server
|
| 44 |
+
SHARDS=$(curl -sS --max-time 15 \
|
| 45 |
+
"https://datasets-server.huggingface.co/parquet?dataset=$(echo $repo | sed 's|/|%2F|g')&config=$config&split=$split" \
|
| 46 |
+
${HF_AUTH:+-H "Authorization: Bearer $HF_AUTH"} 2>/dev/null \
|
| 47 |
+
| python3 -c "
|
| 48 |
+
import sys, json
|
| 49 |
+
try:
|
| 50 |
+
d = json.load(sys.stdin)
|
| 51 |
+
for f in d.get('parquet_files', [])[:50]:
|
| 52 |
+
print(f.get('url',''))
|
| 53 |
+
except: pass
|
| 54 |
+
" 2>/dev/null)
|
| 55 |
+
|
| 56 |
+
if [[ -z "$SHARDS" ]]; then
|
| 57 |
+
echo " no shards or unavailable β skip" >> "$LOG"
|
| 58 |
+
continue
|
| 59 |
+
fi
|
| 60 |
+
|
| 61 |
+
SHARD_COUNT=$(echo "$SHARDS" | wc -l | tr -d ' ')
|
| 62 |
+
echo " found $SHARD_COUNT parquet shards" >> "$LOG"
|
| 63 |
+
|
| 64 |
+
# Process N shards in parallel (each ~500MB, fast filter)
|
| 65 |
+
echo "$SHARDS" | head -20 | xargs -P "$PARALLEL_DOWNLOADS" -I{} bash -c "
|
| 66 |
+
url='{}'
|
| 67 |
+
shard_name=\$(basename \"\$url\" | cut -d? -f1)
|
| 68 |
+
python3 - <<PYEOF 2>>'$LOG'
|
| 69 |
+
import sys, os, urllib.request, json, hashlib, time, io
|
| 70 |
+
url = '\$url'
|
| 71 |
+
src_repo = '$repo'
|
| 72 |
+
sys.path.insert(0, os.path.expanduser('~/.surrogate/bin/lib'))
|
| 73 |
+
try:
|
| 74 |
+
from dedup import DedupStore
|
| 75 |
+
HAS_DEDUP = True
|
| 76 |
+
except ImportError:
|
| 77 |
+
HAS_DEDUP = False
|
| 78 |
+
|
| 79 |
+
try:
|
| 80 |
+
import pyarrow.parquet as pq
|
| 81 |
+
except ImportError:
|
| 82 |
+
print(f' [no pyarrow] skip \$shard_name')
|
| 83 |
+
sys.exit(0)
|
| 84 |
+
|
| 85 |
+
OUT = os.path.expanduser('~/.surrogate/training-pairs.jsonl')
|
| 86 |
+
|
| 87 |
+
try:
|
| 88 |
+
req = urllib.request.Request(url, headers={'User-Agent':'Surrogate-1/parquet'})
|
| 89 |
+
with urllib.request.urlopen(req, timeout=120) as r:
|
| 90 |
+
body = r.read()
|
| 91 |
+
table = pq.read_table(io.BytesIO(body))
|
| 92 |
+
written = skipped = 0
|
| 93 |
+
cols = set(table.column_names)
|
| 94 |
+
n_rows = table.num_rows
|
| 95 |
+
for i in range(n_rows):
|
| 96 |
+
row = {c: table.column(c)[i].as_py() for c in cols}
|
| 97 |
+
# Detect schema by available columns + extract prompt+response
|
| 98 |
+
if 'text' in cols:
|
| 99 |
+
text = str(row.get('text','') or '')[:8000]
|
| 100 |
+
if len(text) < 500: skipped += 1; continue
|
| 101 |
+
# Web-text quality filter
|
| 102 |
+
if not any(s in text for s in ('?','\`\`\`','# ','## ')) and not any(s in text.lower() for s in ('step ','first,','to solve','function ','def ','class ')):
|
| 103 |
+
skipped += 1; continue
|
| 104 |
+
# FineWeb-Edu score gate
|
| 105 |
+
sc = row.get('score') or row.get('edu_score') or 3
|
| 106 |
+
try:
|
| 107 |
+
if float(sc) < 2.5: skipped += 1; continue
|
| 108 |
+
except: pass
|
| 109 |
+
prompt = f'Explain this educational content from {src_repo}:'
|
| 110 |
+
response = text
|
| 111 |
+
elif 'instruction' in cols and 'response' in cols:
|
| 112 |
+
prompt = str(row.get('instruction','') or '')[:4000]
|
| 113 |
+
response = str(row.get('response','') or '')[:8000]
|
| 114 |
+
if len(prompt) < 30 or len(response) < 30: skipped += 1; continue
|
| 115 |
+
elif 'content' in cols and 'language' in cols:
|
| 116 |
+
code = str(row.get('content','') or '')[:6000]
|
| 117 |
+
lang = str(row.get('language','') or 'code')
|
| 118 |
+
if len(code) < 80 or len(code) > 6000: skipped += 1; continue
|
| 119 |
+
prompt = f'Explain this {lang} code:'
|
| 120 |
+
response = code
|
| 121 |
+
else:
|
| 122 |
+
skipped += 1; continue
|
| 123 |
+
|
| 124 |
+
# Central dedup
|
| 125 |
+
if HAS_DEDUP and not DedupStore.is_new(prompt, source=f'parquet:{src_repo}'):
|
| 126 |
+
skipped += 1; continue
|
| 127 |
+
|
| 128 |
+
with open(OUT, 'a') as f:
|
| 129 |
+
f.write(json.dumps({
|
| 130 |
+
'ts': time.time(),
|
| 131 |
+
'source': f'parquet:{src_repo}',
|
| 132 |
+
'parquet_shard': '\$shard_name',
|
| 133 |
+
'prompt': prompt[:8000],
|
| 134 |
+
'response': response[:12000],
|
| 135 |
+
}, ensure_ascii=False) + '\n')
|
| 136 |
+
written += 1
|
| 137 |
+
if written >= 5000: break # cap per shard pull
|
| 138 |
+
print(f' [\$shard_name] wrote={written} skipped={skipped} of {n_rows} rows')
|
| 139 |
+
except Exception as e:
|
| 140 |
+
print(f' [\$shard_name] err: {type(e).__name__}: {str(e)[:100]}')
|
| 141 |
+
PYEOF
|
| 142 |
+
" >> "$LOG" 2>&1
|
| 143 |
+
|
| 144 |
+
# Brief cool-down between dataset transitions
|
| 145 |
+
sleep 30
|
| 146 |
+
done
|
| 147 |
+
|
| 148 |
+
echo "[$(date +%H:%M:%S)] parquet-direct cycle done β sleep 30 min" >> "$LOG"
|
| 149 |
+
sleep 1800
|
| 150 |
+
done
|
|
@@ -11,3 +11,4 @@ requests>=2.31
|
|
| 11 |
beautifulsoup4>=4.12
|
| 12 |
feedparser>=6.0
|
| 13 |
PyGithub>=2.1
|
|
|
|
|
|
| 11 |
beautifulsoup4>=4.12
|
| 12 |
feedparser>=6.0
|
| 13 |
PyGithub>=2.1
|
| 14 |
+
pyarrow>=15.0.0
|
|
@@ -241,12 +241,16 @@ echo "[$(date +%H:%M:%S)] hf-dataset-discoverer started (continuous mega-mix hun
|
|
| 241 |
nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
|
| 242 |
echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
|
| 243 |
|
| 244 |
-
# ββ 7f. PARALLEL BULK INGEST (
|
| 245 |
-
# Was: single-thread enrich = drains 293M caps in ~weeks
|
| 246 |
-
# Now: 4 shards by slug-hash = ~4Γ faster. Each shard streams 1/4 of datasets,
|
| 247 |
-
# central dedup ensures no overlap.
|
| 248 |
nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
|
| 249 |
-
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 250 |
|
| 251 |
# ββ 7c. Skill-synthesis daemon (extract patterns from cloned repos β skills) β
|
| 252 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|
|
|
|
| 241 |
nohup bash ~/.surrogate/bin/auto-orchestrate-continuous.sh > "$LOG_DIR/auto-orchestrate-continuous.log" 2>&1 &
|
| 242 |
echo "[$(date +%H:%M:%S)] auto-orchestrate-continuous started (4 parallel workers, never sleeps)" >> "$LOG_DIR/boot.log"
|
| 243 |
|
| 244 |
+
# ββ 7f. PARALLEL BULK INGEST (16 shards by slug-hash, drain 293M cap) βββββββ
|
|
|
|
|
|
|
|
|
|
| 245 |
nohup bash ~/.surrogate/bin/bulk-ingest-parallel.sh > "$LOG_DIR/bulk-ingest-parallel.log" 2>&1 &
|
| 246 |
+
echo "[$(date +%H:%M:%S)] bulk-ingest-parallel started (16 shards, 293M total cap)" >> "$LOG_DIR/boot.log"
|
| 247 |
+
|
| 248 |
+
# ββ 7g. PARQUET-DIRECT INGEST (skip 'datasets' library overhead, 5-10Γ faster) ββ
|
| 249 |
+
# Downloads parquet shards directly via HF datasets-server API + pyarrow filter.
|
| 250 |
+
# Targets only trillion-scale corpora where streaming is too slow.
|
| 251 |
+
# 6 parallel downloads β coordinated with bulk-ingest via central dedup store.
|
| 252 |
+
nohup bash ~/.surrogate/bin/parquet-direct-ingest.sh > "$LOG_DIR/parquet-direct-ingest.log" 2>&1 &
|
| 253 |
+
echo "[$(date +%H:%M:%S)] parquet-direct-ingest started (6 parallel DLs)" >> "$LOG_DIR/boot.log"
|
| 254 |
|
| 255 |
# ββ 7c. Skill-synthesis daemon (extract patterns from cloned repos β skills) β
|
| 256 |
nohup bash ~/.surrogate/bin/skill-synthesis-daemon.sh > "$LOG_DIR/skill-synthesis.log" 2>&1 &
|