Ezekiel999 commited on
Commit
cfacedb
Β·
verified Β·
1 Parent(s): 88202f1

Add corpus-build monitoring runbook (SSH, tmux, GCS commands)

Browse files
Files changed (1) hide show
  1. MONITORING.md +192 -0
MONITORING.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AksaraLLM 20B β€” Monitoring Runbook
2
+
3
+ Corpus build is running in a **tmux session on `aksara-20b-v6e-8`**. It will keep running even after this Devin session ends. Use this runbook to check progress, diagnose issues, or stop / restart the build.
4
+
5
+ ## Quick status (from your laptop)
6
+
7
+ ```bash
8
+ # 1. Check bucket growth (how much corpus is in GCS)
9
+ gcloud storage du -s gs://aksarallm20b-eu/pretrain/v1/
10
+
11
+ # Per-source breakdown
12
+ gcloud storage du -s \
13
+ gs://aksarallm20b-eu/pretrain/v1/fineweb \
14
+ gs://aksarallm20b-eu/pretrain/v1/fineweb2_id \
15
+ gs://aksarallm20b-eu/pretrain/v1/culturax_id \
16
+ gs://aksarallm20b-eu/pretrain/v1/culturax_ms \
17
+ gs://aksarallm20b-eu/pretrain/v1/fineweb2_jv \
18
+ gs://aksarallm20b-eu/pretrain/v1/fineweb2_su \
19
+ gs://aksarallm20b-eu/pretrain/v1/culturax_jv \
20
+ gs://aksarallm20b-eu/pretrain/v1/culturax_su \
21
+ gs://aksarallm20b-eu/pretrain/v1/code_search_net \
22
+ gs://aksarallm20b-eu/pretrain/v1/wikipedia_id \
23
+ gs://aksarallm20b-eu/pretrain/v1/wikipedia_jv \
24
+ gs://aksarallm20b-eu/pretrain/v1/wikipedia_en
25
+
26
+ # 2. Read the latest manifest
27
+ gcloud storage cp gs://aksarallm20b-eu/pretrain/v1/manifest.json - | jq '.per_source_stats'
28
+ ```
29
+
30
+ ## Detailed status (SSH into TPU)
31
+
32
+ ```bash
33
+ gcloud compute tpus tpu-vm ssh aksara-20b-v6e-8 --zone=europe-west4-a --worker=0
34
+ ```
35
+
36
+ Once inside:
37
+
38
+ ```bash
39
+ # See tmux sessions
40
+ tmux ls
41
+ # β†’ corpus: 2 windows (created ...)
42
+
43
+ # Attach to live view (Ctrl-b n = next window, Ctrl-b d = detach)
44
+ tmux attach -t corpus
45
+
46
+ # Or just tail logs without attaching
47
+ tail -f ~/corpus_build.log # producer
48
+ tail -f ~/corpus_upload.log # uploader
49
+
50
+ # Local tmpfs state
51
+ du -sh /dev/shm/corpus_work/
52
+ find /dev/shm/corpus_work -type f | head
53
+ ```
54
+
55
+ ## Expected progress
56
+
57
+ | What | Now |
58
+ |---|---|
59
+ | Sources being processed | fineweb, fineweb2_id, culturax_id, culturax_ms, fineweb2_jv, fineweb2_su, culturax_jv, culturax_su, code_search_net, wikipedia_id, wikipedia_jv, wikipedia_en |
60
+ | Target per full run | 100B tokens (budget distributed by mix %) |
61
+ | Throughput | ~100k tokens/sec (CPU-bound on one VM) |
62
+ | First shard β†’ GCS | already landed (see bucket) |
63
+ | Each new shard | ~6 min per 500 MB (~250M tokens) |
64
+ | Wall-clock for 100B | ~10–12 days single-threaded |
65
+
66
+ **If you want faster:** start more parallel producers on different source subsets β€” see "Scaling out" below.
67
+
68
+ ## Key files on the TPU VM
69
+
70
+ | File | Purpose |
71
+ |---|---|
72
+ | `~/corpus_build_runner.sh` | Launches producer, auto-restarts on crash up to 20 times |
73
+ | `~/corpus_upload_loop.sh` | Wrapper; invokes `corpus_upload_loop.py` |
74
+ | `~/corpus_upload_loop.py` | Python+gcsfs mirror of tmpfs β†’ GCS every 10 min, deletes local shards older than 20 min |
75
+ | `~/.aksara_env` | Holds `HF_TOKEN`; sourced by runner |
76
+ | `~/AksaraLLM/scripts/build_pretrain_corpus_v2.py` | The actual producer |
77
+ | `/dev/shm/corpus_work/` | Producer's output (RAM-backed 709 GB tmpfs) |
78
+ | `/home/ubuntu/corpus_build.log` | Producer log |
79
+ | `/home/ubuntu/corpus_upload.log` | Uploader log |
80
+
81
+ ## Stop the build
82
+
83
+ ```bash
84
+ # On TPU VM:
85
+ tmux kill-session -t corpus
86
+ # Any shards already uploaded stay in GCS. Local tmpfs shards not yet uploaded are lost on VM reboot.
87
+ ```
88
+
89
+ ## Restart the build (fresh)
90
+
91
+ ```bash
92
+ # On TPU VM:
93
+ tmux kill-session -t corpus 2>/dev/null
94
+ rm -rf /dev/shm/corpus_work/*
95
+ tmux new-session -d -s corpus -n producer 'bash ~/corpus_build_runner.sh'
96
+ tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'
97
+ ```
98
+
99
+ Note: the producer deduplicates within its in-memory state only. A restart forgets previous near-dups, so restarts cost a small amount of extra duplicates. Exact-dedup (`sha256`) is reset too; cross-run dedup should be re-done at consolidation time before the real pretrain run.
100
+
101
+ ## Scaling out (speed up 10Γ—)
102
+
103
+ The current producer is single-threaded per-source, single-process per-VM. To produce 400B tokens in 2–3 weeks you need ~8–10 parallel producers. Easiest scheme: **partition sources across tmux windows**.
104
+
105
+ ```bash
106
+ # Example: 3 parallel producers, each owning different sources
107
+ tmux kill-session -t corpus 2>/dev/null
108
+
109
+ # Producer 1: English (highest volume source)
110
+ tmux new-session -d -s corpus -n p1 \
111
+ 'source ~/.aksara_env && cd ~/AksaraLLM && \
112
+ python3 -u scripts/build_pretrain_corpus_v2.py build \
113
+ --config configs/aksara_20b_dense.json \
114
+ --output-dir /dev/shm/corpus_work \
115
+ --assets-dir /dev/shm/pretrain_assets \
116
+ --target-total-tokens 400000000000 \
117
+ --shard-target-bytes 524288000 \
118
+ --no-decontam \
119
+ --sources fineweb,wikipedia_en 2>&1 | tee -a ~/corpus_p1.log'
120
+
121
+ # Producer 2: Indonesian bulk
122
+ tmux new-window -t corpus -n p2 \
123
+ 'source ~/.aksara_env && cd ~/AksaraLLM && \
124
+ python3 -u scripts/build_pretrain_corpus_v2.py build \
125
+ --config configs/aksara_20b_dense.json \
126
+ --output-dir /dev/shm/corpus_work \
127
+ --assets-dir /dev/shm/pretrain_assets \
128
+ --target-total-tokens 400000000000 \
129
+ --shard-target-bytes 524288000 \
130
+ --no-decontam \
131
+ --sources fineweb2_id,culturax_id,wikipedia_id 2>&1 | tee -a ~/corpus_p2.log'
132
+
133
+ # Producer 3: Malay + JV + SU
134
+ tmux new-window -t corpus -n p3 \
135
+ 'source ~/.aksara_env && cd ~/AksaraLLM && \
136
+ python3 -u scripts/build_pretrain_corpus_v2.py build \
137
+ --config configs/aksara_20b_dense.json \
138
+ --output-dir /dev/shm/corpus_work \
139
+ --assets-dir /dev/shm/pretrain_assets \
140
+ --target-total-tokens 400000000000 \
141
+ --shard-target-bytes 524288000 \
142
+ --no-decontam \
143
+ --sources culturax_ms,fineweb2_jv,fineweb2_su,culturax_jv,culturax_su,wikipedia_jv 2>&1 | tee -a ~/corpus_p3.log'
144
+
145
+ # Keep the uploader running
146
+ tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'
147
+ ```
148
+
149
+ ⚠️ Each producer writes to `/dev/shm/corpus_work/{source}/shard-*.parquet`. Since no two producers share a source (if you partition correctly), they don't collide. The uploader mirrors everything to GCS.
150
+
151
+ ## Troubleshooting
152
+
153
+ ### "bucket size: 0.00 GB" stays zero
154
+ Uploader is running but nothing to sync because no shard has flushed yet. Shards flush every ~500 MB of text (~250M tokens). Wait 6–10 min after producer start.
155
+
156
+ ### Producer keeps restarting
157
+ Check `tail -50 ~/corpus_build.log` for the actual error. Common causes:
158
+ - Rate limiting from HF Hub β†’ wait a few minutes, it retries automatically
159
+ - Dataset schema change β†’ lock `datasets` version in requirements
160
+ - OOM β†’ `/dev/shm` filling up; the uploader should be purging old shards every 10 min, but if uploader crashed, local files accumulate
161
+
162
+ ### `gcloud storage du` shows it working, but manifest.json missing
163
+ Manifest writes after each source completes. Early in the run only the first shard exists without a manifest. Manifest appears once fineweb (or any source) finishes.
164
+
165
+ ### tmux session disappears after VM reboot
166
+ This is expected β€” tmux state doesn't survive reboots. TPU preemptible nodes can reboot. To make the producer survive reboots, configure a systemd service (see `Auto-restart on reboot` below).
167
+
168
+ ## Auto-restart on reboot (optional)
169
+
170
+ Create `/etc/systemd/system/aksara-corpus.service` on TPU VM:
171
+
172
+ ```ini
173
+ [Unit]
174
+ Description=AksaraLLM corpus build producer
175
+ After=network-online.target
176
+
177
+ [Service]
178
+ Type=simple
179
+ User=ubuntu
180
+ ExecStart=/bin/bash /home/ubuntu/corpus_build_runner.sh
181
+ Restart=always
182
+ RestartSec=60
183
+ StandardOutput=append:/home/ubuntu/corpus_build.log
184
+ StandardError=append:/home/ubuntu/corpus_build.log
185
+
186
+ [Install]
187
+ WantedBy=multi-user.target
188
+ ```
189
+
190
+ Then: `sudo systemctl daemon-reload && sudo systemctl enable --now aksara-corpus`.
191
+
192
+ Same pattern for `aksara-corpus-uploader.service`.