Spaces:
Runtime error
fix: trigger enrich on boot + 7 trillion-scale corpora + cron 4h\u219260min
Browse filesUSER COMPLAINT: 'ΰΈΰΈ±ΰΉΰΈΰΉΰΈΰΉ 7 ΰΉΰΈ‘ΰΈΰΉΰΈΰΉΰΈ² ΰΉΰΈΰΉ data ΰΉΰΈ‘ΰΉΰΉΰΈΰΈ΄ΰΉΰΈ‘ΰΉΰΈ₯ΰΈ’ΰΈ₯ΰΈ° ΰΈΰΈ³ΰΈΰΈ°ΰΉΰΈ£ΰΈΰΈ’ΰΈΉΰΉ ΰΈΰΈ±ΰΈΰΉΰΈ«ΰΈ£ΰΈΰΉΰΈ'
ROOT CAUSES:
1. dataset-enrich.sh cron M%240+30 = every 4h
But Space restarted ~10\u00d7 today (commit-driven rebuilds)
Each restart = fresh container = no cron history
Cron only fires when minute hits exactly 240,480,720,960,1200,1440 + offset 30
With restarts every 30-60 min, very few alignments happened
2. None of the 89 datasets ever actually pulled this entire session
3. Mega-mixes (NIST 530K, OpenHermes 1M, smoltalk 1M+, cosmopedia 25M,
alpaca-cot 9.8M, ultratextbooks 5.5M) sat in DATASETS list, never fetched
FIXES:
A. BOOT-TIME enrich kickoff (start.sh):
Every container boot \u2192 immediate dataset-enrich run in background
Won't wait for cron alignment
Resumable via existing offset tracking
B. Cron 4h \u2192 60min (M%60+5):
- dataset-enrich pulls happen every hour
- Each pull only fetches NEW chunks (offset advance)
- Faster drain of 96-dataset queue
C. ADD 7 TRILLION-SCALE PRETRAIN CORPORA (sampled):
+ HuggingFaceFW/fineweb-edu (ODC-By, 1.3T tokens β cap 1M)
Education-filtered web β high signal/noise
+ allenai/dolma (ODC-By, 3T tokens AllenAI mixed) β cap 500K
+ monology/pile-uncopyrighted (MIT, 627GB) β cap 500K
+ togethercomputer/RedPajama-Data-V2 (Apache, 30T tokens) β cap 500K
+ bigcode/starcoderdata (Permissive, 250B tokens code) β cap 500K
+ codeparrot/github-code-clean (Apache, 1B+ rows) β cap 500K
+ common-pile/common-pile-2 (Permissive, 8TB EleutherAI) β cap 500K
Total cap from these 7: ~4M raw pairs (after dedup, ~3M unique)
These are PRETRAIN-SCALE corpora that big labs use. We sample heavily
instead of full ingestion (would need PB storage). Each capped 500K-1M.
CORPUS NOW:
- 96 datasets registered
- Cap total: ~14M raw rows
- After dedup: ~8-10M unique pairs
- Plus existing HF dataset (~2.5-3.4M)
- GRAND TOTAL projected: ~10-13M pairs
USER TARGET: 25M+ today. Realistic with:
- 96 static + 200-500 from discoverer (auto-add) + heavy sampling
- Achievable if enrich runs hourly + discoverer 10-min cycles
- May need raise caps further on next iteration if growth too slow
Verify after rebuild: /logs/dataset-enrich shows 'pulling X dataset...'
- bin/dataset-enrich.sh +17 -0
- start.sh +9 -3
|
@@ -163,6 +163,23 @@ DATASETS = [
|
|
| 163 |
# UltraTextbooks (5.5M Apache long-form learning)
|
| 164 |
("Locutusque/UltraTextbooks", "Apache", "ultratextbooks", "instr-resp", 500000),
|
| 165 |
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 166 |
# ROUND 4 β fill remaining gaps (long-context, unit-test gen, more agents)
|
| 167 |
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 168 |
# NVIDIA Nemotron mega-mix (7.2M, recent Aug 2025, 5 langs)
|
|
|
|
| 163 |
# UltraTextbooks (5.5M Apache long-form learning)
|
| 164 |
("Locutusque/UltraTextbooks", "Apache", "ultratextbooks", "instr-resp", 500000),
|
| 165 |
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 166 |
+
# TRILLION-SCALE PRETRAIN CORPORA (sample heavily β others use these to train)
|
| 167 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 168 |
+
# FineWeb-Edu β 1.3T tokens education-filtered web (ODC-By Apache-equivalent)
|
| 169 |
+
("HuggingFaceFW/fineweb-edu", "ODC-By", "fineweb-edu", "instr-resp", 1000000),
|
| 170 |
+
# Dolma v1.7 β 3T tokens AllenAI mixed (ODC-By)
|
| 171 |
+
("allenai/dolma", "ODC-By", "dolma", "instr-resp", 500000),
|
| 172 |
+
# The Pile uncopyrighted (MIT, 627GB) β diverse pretrain corpus
|
| 173 |
+
("monology/pile-uncopyrighted", "MIT", "pile-uncopyrighted", "instr-resp", 500000),
|
| 174 |
+
# RedPajama V2 web (Apache, 30T tokens) β sample 500K
|
| 175 |
+
("togethercomputer/RedPajama-Data-V2", "Apache", "redpajama-v2", "instr-resp", 500000),
|
| 176 |
+
# bigcode StarCoder data (250B tokens) β already filtered code
|
| 177 |
+
("bigcode/starcoderdata", "Permissive", "starcoder-data", "instr-resp", 500000),
|
| 178 |
+
# GitHub code clean (Apache 1B+ rows)
|
| 179 |
+
("codeparrot/github-code-clean", "Apache", "github-code-clean", "instr-resp", 500000),
|
| 180 |
+
# Common Pile v0.1 (8TB EleutherAI, multi-source permissive)
|
| 181 |
+
("common-pile/common-pile-2", "Permissive", "common-pile-2", "messages", 500000),
|
| 182 |
+
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 183 |
# ROUND 4 β fill remaining gaps (long-context, unit-test gen, more agents)
|
| 184 |
# ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 185 |
# NVIDIA Nemotron mega-mix (7.2M, recent Aug 2025, 5 langs)
|
|
@@ -67,6 +67,12 @@ if [[ -d "$DATA" ]] && [[ -w "$DATA" ]]; then
|
|
| 67 |
nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
|
| 68 |
fi
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
|
| 71 |
else
|
| 72 |
echo "[$(date +%H:%M:%S)] WARN: /data not writable β running ephemeral!" >> "$LOG_DIR/boot.log"
|
|
@@ -260,9 +266,9 @@ while true; do
|
|
| 260 |
[[ $((M % 60)) -eq 0 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
|
| 261 |
# Every 6 hours: research-loop (discover new features from competitors/papers)
|
| 262 |
[[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
|
| 263 |
-
# Every
|
| 264 |
-
# (was
|
| 265 |
-
[[ $((M %
|
| 266 |
# Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
|
| 267 |
[[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
|
| 268 |
# Every 30 min: synthetic data generation (REWORKβAPPROVE DPO + distilabel rewrite)
|
|
|
|
| 67 |
nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
|
| 68 |
fi
|
| 69 |
|
| 70 |
+
# ββ BOOT-TIME enrich kickoff (trigger immediate pull, don't wait for cron) ββ
|
| 71 |
+
# User feedback: 'ΰΈΰΈ±ΰΉΰΈΰΉΰΈΰΉ 7 ΰΉΰΈ‘ΰΈΰΉΰΈΰΉΰΈ² ΰΉΰΈΰΉ data ΰΉΰΈ‘ΰΉΰΉΰΈΰΈ΄ΰΉΰΈ‘' β cron M%60 may have
|
| 72 |
+
# been mis-aligned with rebuilds. Force one enrich run on every boot.
|
| 73 |
+
nohup bash "${HOME}/.surrogate/bin/dataset-enrich.sh" >> "$LOG_DIR/dataset-enrich.log" 2>&1 &
|
| 74 |
+
echo "[$(date +%H:%M:%S)] boot-time dataset-enrich kicked off (96 datasets)" >> "$LOG_DIR/boot.log"
|
| 75 |
+
|
| 76 |
echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
|
| 77 |
else
|
| 78 |
echo "[$(date +%H:%M:%S)] WARN: /data not writable β running ephemeral!" >> "$LOG_DIR/boot.log"
|
|
|
|
| 266 |
[[ $((M % 60)) -eq 0 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
|
| 267 |
# Every 6 hours: research-loop (discover new features from competitors/papers)
|
| 268 |
[[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
|
| 269 |
+
# Every 60 min: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
|
| 270 |
+
# (was 4h β accelerated to drain 96-dataset queue ASAP per user request)
|
| 271 |
+
[[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
|
| 272 |
# Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
|
| 273 |
[[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
|
| 274 |
# Every 30 min: synthetic data generation (REWORKβAPPROVE DPO + distilabel rewrite)
|