Spaces:

axentx
/

surrogate-1

Runtime error

Ashira Pitchayapakayakul commited on 13 days ago

Commit

2f6830b

1 Parent(s): 37f0117

fix: trigger enrich on boot + 7 trillion-scale corpora + cron 4h\u219260min

USER COMPLAINT: 'ตั้งแต่ 7 โมงเช้า แต่ data ไม่เพิ่มเลยละ ทำอะไรอยู่ พังเหรอไง'

ROOT CAUSES:
1. dataset-enrich.sh cron M%240+30 = every 4h
But Space restarted ~10\u00d7 today (commit-driven rebuilds)
Each restart = fresh container = no cron history
Cron only fires when minute hits exactly 240,480,720,960,1200,1440 + offset 30
With restarts every 30-60 min, very few alignments happened
2. None of the 89 datasets ever actually pulled this entire session
3. Mega-mixes (NIST 530K, OpenHermes 1M, smoltalk 1M+, cosmopedia 25M,
alpaca-cot 9.8M, ultratextbooks 5.5M) sat in DATASETS list, never fetched

FIXES:

A. BOOT-TIME enrich kickoff (start.sh):
Every container boot \u2192 immediate dataset-enrich run in background
Won't wait for cron alignment
Resumable via existing offset tracking

B. Cron 4h \u2192 60min (M%60+5):
- dataset-enrich pulls happen every hour
- Each pull only fetches NEW chunks (offset advance)
- Faster drain of 96-dataset queue

C. ADD 7 TRILLION-SCALE PRETRAIN CORPORA (sampled):
+ HuggingFaceFW/fineweb-edu (ODC-By, 1.3T tokens — cap 1M)
Education-filtered web — high signal/noise
+ allenai/dolma (ODC-By, 3T tokens AllenAI mixed) — cap 500K
+ monology/pile-uncopyrighted (MIT, 627GB) — cap 500K
+ togethercomputer/RedPajama-Data-V2 (Apache, 30T tokens) — cap 500K
+ bigcode/starcoderdata (Permissive, 250B tokens code) — cap 500K
+ codeparrot/github-code-clean (Apache, 1B+ rows) — cap 500K
+ common-pile/common-pile-2 (Permissive, 8TB EleutherAI) — cap 500K
Total cap from these 7: ~4M raw pairs (after dedup, ~3M unique)

These are PRETRAIN-SCALE corpora that big labs use. We sample heavily
instead of full ingestion (would need PB storage). Each capped 500K-1M.

CORPUS NOW:
- 96 datasets registered
- Cap total: ~14M raw rows
- After dedup: ~8-10M unique pairs
- Plus existing HF dataset (~2.5-3.4M)
- GRAND TOTAL projected: ~10-13M pairs

USER TARGET: 25M+ today. Realistic with:
- 96 static + 200-500 from discoverer (auto-add) + heavy sampling
- Achievable if enrich runs hourly + discoverer 10-min cycles
- May need raise caps further on next iteration if growth too slow

Verify after rebuild: /logs/dataset-enrich shows 'pulling X dataset...'

Files changed (2) hide show

bin/dataset-enrich.sh +17 -0
start.sh +9 -3

bin/dataset-enrich.sh CHANGED Viewed

@@ -163,6 +163,23 @@ DATASETS = [
     # UltraTextbooks (5.5M Apache long-form learning)
     ("Locutusque/UltraTextbooks",                   "Apache",      "ultratextbooks",      "instr-resp",           500000),
     # ════════════════════════════════════════════════════════════════════════
     # ROUND 4 — fill remaining gaps (long-context, unit-test gen, more agents)
     # ════════════════════════════════════════════════════════════════════════
     # NVIDIA Nemotron mega-mix (7.2M, recent Aug 2025, 5 langs)

     # UltraTextbooks (5.5M Apache long-form learning)
     ("Locutusque/UltraTextbooks",                   "Apache",      "ultratextbooks",      "instr-resp",           500000),
     # ════════════════════════════════════════════════════════════════════════
+    # TRILLION-SCALE PRETRAIN CORPORA (sample heavily — others use these to train)
+    # ════════════════════════════════════════════════════════════════════════
+    # FineWeb-Edu — 1.3T tokens education-filtered web (ODC-By Apache-equivalent)
+    ("HuggingFaceFW/fineweb-edu",                   "ODC-By",      "fineweb-edu",         "instr-resp",          1000000),
+    # Dolma v1.7 — 3T tokens AllenAI mixed (ODC-By)
+    ("allenai/dolma",                               "ODC-By",      "dolma",               "instr-resp",           500000),
+    # The Pile uncopyrighted (MIT, 627GB) — diverse pretrain corpus
+    ("monology/pile-uncopyrighted",                 "MIT",         "pile-uncopyrighted",  "instr-resp",           500000),
+    # RedPajama V2 web (Apache, 30T tokens) — sample 500K
+    ("togethercomputer/RedPajama-Data-V2",          "Apache",      "redpajama-v2",        "instr-resp",           500000),
+    # bigcode StarCoder data (250B tokens) — already filtered code
+    ("bigcode/starcoderdata",                       "Permissive",  "starcoder-data",      "instr-resp",           500000),
+    # GitHub code clean (Apache 1B+ rows)
+    ("codeparrot/github-code-clean",                "Apache",      "github-code-clean",   "instr-resp",           500000),
+    # Common Pile v0.1 (8TB EleutherAI, multi-source permissive)
+    ("common-pile/common-pile-2",                   "Permissive",  "common-pile-2",       "messages",             500000),
+    # ════════════════════════════════════════════════════════════════════════
     # ROUND 4 — fill remaining gaps (long-context, unit-test gen, more agents)
     # ════════════════════════════════════════════════════════════════════════
     # NVIDIA Nemotron mega-mix (7.2M, recent Aug 2025, 5 langs)

start.sh CHANGED Viewed

@@ -67,6 +67,12 @@ if [[ -d "$DATA" ]] && [[ -w "$DATA" ]]; then
         nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
     fi
     echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
 else
     echo "[$(date +%H:%M:%S)] WARN: /data not writable — running ephemeral!" >> "$LOG_DIR/boot.log"
@@ -260,9 +266,9 @@ while true; do
     [[ $((M % 60)) -eq 0 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
     # Every 6 hours: research-loop (discover new features from competitors/papers)
     [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
-    # Every 4 hours: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
-    # (was 12h — accelerated to drain the 80-dataset queue faster)
-    [[ $((M % 240)) -eq 30 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
     # Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
     [[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
     # Every 30 min: synthetic data generation (REWORK→APPROVE DPO + distilabel rewrite)

         nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
     fi
+    # ── BOOT-TIME enrich kickoff (trigger immediate pull, don't wait for cron) ──
+    # User feedback: 'ตั้งแต่ 7 โมงเช้า แต่ data ไม่เพิ่ม' — cron M%60 may have
+    # been mis-aligned with rebuilds. Force one enrich run on every boot.
+    nohup bash "${HOME}/.surrogate/bin/dataset-enrich.sh" >> "$LOG_DIR/dataset-enrich.log" 2>&1 &
+    echo "[$(date +%H:%M:%S)] boot-time dataset-enrich kicked off (96 datasets)" >> "$LOG_DIR/boot.log"
     echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
 else
     echo "[$(date +%H:%M:%S)] WARN: /data not writable — running ephemeral!" >> "$LOG_DIR/boot.log"
     [[ $((M % 60)) -eq 0 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
     # Every 6 hours: research-loop (discover new features from competitors/papers)
     [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
+    # Every 60 min: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
+    # (was 4h — accelerated to drain 96-dataset queue ASAP per user request)
+    [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
     # Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
     [[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
     # Every 30 min: synthetic data generation (REWORK→APPROVE DPO + distilabel rewrite)