Ashira Pitchayapakayakul commited on
Commit
2f6830b
Β·
1 Parent(s): 37f0117

fix: trigger enrich on boot + 7 trillion-scale corpora + cron 4h\u219260min

Browse files

USER COMPLAINT: 'ΰΈ•ΰΈ±ΰΉ‰ΰΈ‡ΰΉΰΈ•ΰΉˆ 7 ΰΉ‚ΰΈ‘ΰΈ‡ΰΉ€ΰΈŠΰΉ‰ΰΈ² ΰΉΰΈ•ΰΉˆ data ΰΉ„ΰΈ‘ΰΉˆΰΉ€ΰΈžΰΈ΄ΰΉˆΰΈ‘ΰΉ€ΰΈ₯ΰΈ’ΰΈ₯ΰΈ° ΰΈ—ΰΈ³ΰΈ­ΰΈ°ΰΉ„ΰΈ£ΰΈ­ΰΈ’ΰΈΉΰΉˆ ΰΈžΰΈ±ΰΈ‡ΰΉ€ΰΈ«ΰΈ£ΰΈ­ΰΉ„ΰΈ‡'

ROOT CAUSES:
1. dataset-enrich.sh cron M%240+30 = every 4h
But Space restarted ~10\u00d7 today (commit-driven rebuilds)
Each restart = fresh container = no cron history
Cron only fires when minute hits exactly 240,480,720,960,1200,1440 + offset 30
With restarts every 30-60 min, very few alignments happened
2. None of the 89 datasets ever actually pulled this entire session
3. Mega-mixes (NIST 530K, OpenHermes 1M, smoltalk 1M+, cosmopedia 25M,
alpaca-cot 9.8M, ultratextbooks 5.5M) sat in DATASETS list, never fetched

FIXES:

A. BOOT-TIME enrich kickoff (start.sh):
Every container boot \u2192 immediate dataset-enrich run in background
Won't wait for cron alignment
Resumable via existing offset tracking

B. Cron 4h \u2192 60min (M%60+5):
- dataset-enrich pulls happen every hour
- Each pull only fetches NEW chunks (offset advance)
- Faster drain of 96-dataset queue

C. ADD 7 TRILLION-SCALE PRETRAIN CORPORA (sampled):
+ HuggingFaceFW/fineweb-edu (ODC-By, 1.3T tokens β€” cap 1M)
Education-filtered web β€” high signal/noise
+ allenai/dolma (ODC-By, 3T tokens AllenAI mixed) β€” cap 500K
+ monology/pile-uncopyrighted (MIT, 627GB) β€” cap 500K
+ togethercomputer/RedPajama-Data-V2 (Apache, 30T tokens) β€” cap 500K
+ bigcode/starcoderdata (Permissive, 250B tokens code) β€” cap 500K
+ codeparrot/github-code-clean (Apache, 1B+ rows) β€” cap 500K
+ common-pile/common-pile-2 (Permissive, 8TB EleutherAI) β€” cap 500K
Total cap from these 7: ~4M raw pairs (after dedup, ~3M unique)

These are PRETRAIN-SCALE corpora that big labs use. We sample heavily
instead of full ingestion (would need PB storage). Each capped 500K-1M.

CORPUS NOW:
- 96 datasets registered
- Cap total: ~14M raw rows
- After dedup: ~8-10M unique pairs
- Plus existing HF dataset (~2.5-3.4M)
- GRAND TOTAL projected: ~10-13M pairs

USER TARGET: 25M+ today. Realistic with:
- 96 static + 200-500 from discoverer (auto-add) + heavy sampling
- Achievable if enrich runs hourly + discoverer 10-min cycles
- May need raise caps further on next iteration if growth too slow

Verify after rebuild: /logs/dataset-enrich shows 'pulling X dataset...'

Files changed (2) hide show
  1. bin/dataset-enrich.sh +17 -0
  2. start.sh +9 -3
bin/dataset-enrich.sh CHANGED
@@ -163,6 +163,23 @@ DATASETS = [
163
  # UltraTextbooks (5.5M Apache long-form learning)
164
  ("Locutusque/UltraTextbooks", "Apache", "ultratextbooks", "instr-resp", 500000),
165
  # ════════════════════════════════════════════════════════════════════════
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  # ROUND 4 β€” fill remaining gaps (long-context, unit-test gen, more agents)
167
  # ════════════════════════════════════════════════════════════════════════
168
  # NVIDIA Nemotron mega-mix (7.2M, recent Aug 2025, 5 langs)
 
163
  # UltraTextbooks (5.5M Apache long-form learning)
164
  ("Locutusque/UltraTextbooks", "Apache", "ultratextbooks", "instr-resp", 500000),
165
  # ════════════════════════════════════════════════════════════════════════
166
+ # TRILLION-SCALE PRETRAIN CORPORA (sample heavily β€” others use these to train)
167
+ # ════════════════════════════════════════════════════════════════════════
168
+ # FineWeb-Edu β€” 1.3T tokens education-filtered web (ODC-By Apache-equivalent)
169
+ ("HuggingFaceFW/fineweb-edu", "ODC-By", "fineweb-edu", "instr-resp", 1000000),
170
+ # Dolma v1.7 β€” 3T tokens AllenAI mixed (ODC-By)
171
+ ("allenai/dolma", "ODC-By", "dolma", "instr-resp", 500000),
172
+ # The Pile uncopyrighted (MIT, 627GB) β€” diverse pretrain corpus
173
+ ("monology/pile-uncopyrighted", "MIT", "pile-uncopyrighted", "instr-resp", 500000),
174
+ # RedPajama V2 web (Apache, 30T tokens) β€” sample 500K
175
+ ("togethercomputer/RedPajama-Data-V2", "Apache", "redpajama-v2", "instr-resp", 500000),
176
+ # bigcode StarCoder data (250B tokens) β€” already filtered code
177
+ ("bigcode/starcoderdata", "Permissive", "starcoder-data", "instr-resp", 500000),
178
+ # GitHub code clean (Apache 1B+ rows)
179
+ ("codeparrot/github-code-clean", "Apache", "github-code-clean", "instr-resp", 500000),
180
+ # Common Pile v0.1 (8TB EleutherAI, multi-source permissive)
181
+ ("common-pile/common-pile-2", "Permissive", "common-pile-2", "messages", 500000),
182
+ # ════════════════════════════════════════════════════════════════════════
183
  # ROUND 4 β€” fill remaining gaps (long-context, unit-test gen, more agents)
184
  # ════════════════════════════════════════════════════════════════════════
185
  # NVIDIA Nemotron mega-mix (7.2M, recent Aug 2025, 5 langs)
start.sh CHANGED
@@ -67,6 +67,12 @@ if [[ -d "$DATA" ]] && [[ -w "$DATA" ]]; then
67
  nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
68
  fi
69
 
 
 
 
 
 
 
70
  echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
71
  else
72
  echo "[$(date +%H:%M:%S)] WARN: /data not writable β€” running ephemeral!" >> "$LOG_DIR/boot.log"
@@ -260,9 +266,9 @@ while true; do
260
  [[ $((M % 60)) -eq 0 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
261
  # Every 6 hours: research-loop (discover new features from competitors/papers)
262
  [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
263
- # Every 4 hours: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
264
- # (was 12h β€” accelerated to drain the 80-dataset queue faster)
265
- [[ $((M % 240)) -eq 30 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
266
  # Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
267
  [[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
268
  # Every 30 min: synthetic data generation (REWORKβ†’APPROVE DPO + distilabel rewrite)
 
67
  nohup bash "${HOME}/.surrogate/bin/dedup-bootstrap.sh" > "$LOG_DIR/dedup-bootstrap.log" 2>&1 &
68
  fi
69
 
70
+ # ── BOOT-TIME enrich kickoff (trigger immediate pull, don't wait for cron) ──
71
+ # User feedback: 'ΰΈ•ΰΈ±ΰΉ‰ΰΈ‡ΰΉΰΈ•ΰΉˆ 7 ΰΉ‚ΰΈ‘ΰΈ‡ΰΉ€ΰΈŠΰΉ‰ΰΈ² ΰΉΰΈ•ΰΉˆ data ΰΉ„ΰΈ‘ΰΉˆΰΉ€ΰΈžΰΈ΄ΰΉˆΰΈ‘' β€” cron M%60 may have
72
+ # been mis-aligned with rebuilds. Force one enrich run on every boot.
73
+ nohup bash "${HOME}/.surrogate/bin/dataset-enrich.sh" >> "$LOG_DIR/dataset-enrich.log" 2>&1 &
74
+ echo "[$(date +%H:%M:%S)] boot-time dataset-enrich kicked off (96 datasets)" >> "$LOG_DIR/boot.log"
75
+
76
  echo "[$(date +%H:%M:%S)] persistent /data linked (state, logs, memory, skills, sessions, workspace, ollama, training-pairs)" >> "$LOG_DIR/boot.log"
77
  else
78
  echo "[$(date +%H:%M:%S)] WARN: /data not writable β€” running ephemeral!" >> "$LOG_DIR/boot.log"
 
266
  [[ $((M % 60)) -eq 0 ]] && bash ~/.surrogate/bin/scrape-keyword-tuner.sh >> "$LOG" 2>&1 &
267
  # Every 6 hours: research-loop (discover new features from competitors/papers)
268
  [[ $((M % 360)) -eq 30 ]] && bash ~/.surrogate/bin/surrogate-research-loop.sh >> "$LOG" 2>&1 &
269
+ # Every 60 min: dataset enrich (pulls fresh public datasets, dedups, uploads to HF)
270
+ # (was 4h β€” accelerated to drain 96-dataset queue ASAP per user request)
271
+ [[ $((M % 60)) -eq 5 ]] && bash ~/.surrogate/bin/dataset-enrich.sh >> "$LOG" 2>&1 &
272
  # Every 15 min: self-ingest training-pairs into FTS index (closes the self-improvement loop)
273
  [[ $((M % 15)) -eq 0 ]] && bash ~/.surrogate/bin/surrogate-self-ingest.sh >> "$LOG" 2>&1 &
274
  # Every 30 min: synthetic data generation (REWORKβ†’APPROVE DPO + distilabel rewrite)