File size: 7,594 Bytes
cfacedb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 | # AksaraLLM 20B β Monitoring Runbook
Corpus build is running in a **tmux session on `aksara-20b-v6e-8`**. It will keep running even after this Devin session ends. Use this runbook to check progress, diagnose issues, or stop / restart the build.
## Quick status (from your laptop)
```bash
# 1. Check bucket growth (how much corpus is in GCS)
gcloud storage du -s gs://aksarallm20b-eu/pretrain/v1/
# Per-source breakdown
gcloud storage du -s \
gs://aksarallm20b-eu/pretrain/v1/fineweb \
gs://aksarallm20b-eu/pretrain/v1/fineweb2_id \
gs://aksarallm20b-eu/pretrain/v1/culturax_id \
gs://aksarallm20b-eu/pretrain/v1/culturax_ms \
gs://aksarallm20b-eu/pretrain/v1/fineweb2_jv \
gs://aksarallm20b-eu/pretrain/v1/fineweb2_su \
gs://aksarallm20b-eu/pretrain/v1/culturax_jv \
gs://aksarallm20b-eu/pretrain/v1/culturax_su \
gs://aksarallm20b-eu/pretrain/v1/code_search_net \
gs://aksarallm20b-eu/pretrain/v1/wikipedia_id \
gs://aksarallm20b-eu/pretrain/v1/wikipedia_jv \
gs://aksarallm20b-eu/pretrain/v1/wikipedia_en
# 2. Read the latest manifest
gcloud storage cp gs://aksarallm20b-eu/pretrain/v1/manifest.json - | jq '.per_source_stats'
```
## Detailed status (SSH into TPU)
```bash
gcloud compute tpus tpu-vm ssh aksara-20b-v6e-8 --zone=europe-west4-a --worker=0
```
Once inside:
```bash
# See tmux sessions
tmux ls
# β corpus: 2 windows (created ...)
# Attach to live view (Ctrl-b n = next window, Ctrl-b d = detach)
tmux attach -t corpus
# Or just tail logs without attaching
tail -f ~/corpus_build.log # producer
tail -f ~/corpus_upload.log # uploader
# Local tmpfs state
du -sh /dev/shm/corpus_work/
find /dev/shm/corpus_work -type f | head
```
## Expected progress
| What | Now |
|---|---|
| Sources being processed | fineweb, fineweb2_id, culturax_id, culturax_ms, fineweb2_jv, fineweb2_su, culturax_jv, culturax_su, code_search_net, wikipedia_id, wikipedia_jv, wikipedia_en |
| Target per full run | 100B tokens (budget distributed by mix %) |
| Throughput | ~100k tokens/sec (CPU-bound on one VM) |
| First shard β GCS | already landed (see bucket) |
| Each new shard | ~6 min per 500 MB (~250M tokens) |
| Wall-clock for 100B | ~10β12 days single-threaded |
**If you want faster:** start more parallel producers on different source subsets β see "Scaling out" below.
## Key files on the TPU VM
| File | Purpose |
|---|---|
| `~/corpus_build_runner.sh` | Launches producer, auto-restarts on crash up to 20 times |
| `~/corpus_upload_loop.sh` | Wrapper; invokes `corpus_upload_loop.py` |
| `~/corpus_upload_loop.py` | Python+gcsfs mirror of tmpfs β GCS every 10 min, deletes local shards older than 20 min |
| `~/.aksara_env` | Holds `HF_TOKEN`; sourced by runner |
| `~/AksaraLLM/scripts/build_pretrain_corpus_v2.py` | The actual producer |
| `/dev/shm/corpus_work/` | Producer's output (RAM-backed 709 GB tmpfs) |
| `/home/ubuntu/corpus_build.log` | Producer log |
| `/home/ubuntu/corpus_upload.log` | Uploader log |
## Stop the build
```bash
# On TPU VM:
tmux kill-session -t corpus
# Any shards already uploaded stay in GCS. Local tmpfs shards not yet uploaded are lost on VM reboot.
```
## Restart the build (fresh)
```bash
# On TPU VM:
tmux kill-session -t corpus 2>/dev/null
rm -rf /dev/shm/corpus_work/*
tmux new-session -d -s corpus -n producer 'bash ~/corpus_build_runner.sh'
tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'
```
Note: the producer deduplicates within its in-memory state only. A restart forgets previous near-dups, so restarts cost a small amount of extra duplicates. Exact-dedup (`sha256`) is reset too; cross-run dedup should be re-done at consolidation time before the real pretrain run.
## Scaling out (speed up 10Γ)
The current producer is single-threaded per-source, single-process per-VM. To produce 400B tokens in 2β3 weeks you need ~8β10 parallel producers. Easiest scheme: **partition sources across tmux windows**.
```bash
# Example: 3 parallel producers, each owning different sources
tmux kill-session -t corpus 2>/dev/null
# Producer 1: English (highest volume source)
tmux new-session -d -s corpus -n p1 \
'source ~/.aksara_env && cd ~/AksaraLLM && \
python3 -u scripts/build_pretrain_corpus_v2.py build \
--config configs/aksara_20b_dense.json \
--output-dir /dev/shm/corpus_work \
--assets-dir /dev/shm/pretrain_assets \
--target-total-tokens 400000000000 \
--shard-target-bytes 524288000 \
--no-decontam \
--sources fineweb,wikipedia_en 2>&1 | tee -a ~/corpus_p1.log'
# Producer 2: Indonesian bulk
tmux new-window -t corpus -n p2 \
'source ~/.aksara_env && cd ~/AksaraLLM && \
python3 -u scripts/build_pretrain_corpus_v2.py build \
--config configs/aksara_20b_dense.json \
--output-dir /dev/shm/corpus_work \
--assets-dir /dev/shm/pretrain_assets \
--target-total-tokens 400000000000 \
--shard-target-bytes 524288000 \
--no-decontam \
--sources fineweb2_id,culturax_id,wikipedia_id 2>&1 | tee -a ~/corpus_p2.log'
# Producer 3: Malay + JV + SU
tmux new-window -t corpus -n p3 \
'source ~/.aksara_env && cd ~/AksaraLLM && \
python3 -u scripts/build_pretrain_corpus_v2.py build \
--config configs/aksara_20b_dense.json \
--output-dir /dev/shm/corpus_work \
--assets-dir /dev/shm/pretrain_assets \
--target-total-tokens 400000000000 \
--shard-target-bytes 524288000 \
--no-decontam \
--sources culturax_ms,fineweb2_jv,fineweb2_su,culturax_jv,culturax_su,wikipedia_jv 2>&1 | tee -a ~/corpus_p3.log'
# Keep the uploader running
tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'
```
β οΈ Each producer writes to `/dev/shm/corpus_work/{source}/shard-*.parquet`. Since no two producers share a source (if you partition correctly), they don't collide. The uploader mirrors everything to GCS.
## Troubleshooting
### "bucket size: 0.00 GB" stays zero
Uploader is running but nothing to sync because no shard has flushed yet. Shards flush every ~500 MB of text (~250M tokens). Wait 6β10 min after producer start.
### Producer keeps restarting
Check `tail -50 ~/corpus_build.log` for the actual error. Common causes:
- Rate limiting from HF Hub β wait a few minutes, it retries automatically
- Dataset schema change β lock `datasets` version in requirements
- OOM β `/dev/shm` filling up; the uploader should be purging old shards every 10 min, but if uploader crashed, local files accumulate
### `gcloud storage du` shows it working, but manifest.json missing
Manifest writes after each source completes. Early in the run only the first shard exists without a manifest. Manifest appears once fineweb (or any source) finishes.
### tmux session disappears after VM reboot
This is expected β tmux state doesn't survive reboots. TPU preemptible nodes can reboot. To make the producer survive reboots, configure a systemd service (see `Auto-restart on reboot` below).
## Auto-restart on reboot (optional)
Create `/etc/systemd/system/aksara-corpus.service` on TPU VM:
```ini
[Unit]
Description=AksaraLLM corpus build producer
After=network-online.target
[Service]
Type=simple
User=ubuntu
ExecStart=/bin/bash /home/ubuntu/corpus_build_runner.sh
Restart=always
RestartSec=60
StandardOutput=append:/home/ubuntu/corpus_build.log
StandardError=append:/home/ubuntu/corpus_build.log
[Install]
WantedBy=multi-user.target
```
Then: `sudo systemctl daemon-reload && sudo systemctl enable --now aksara-corpus`.
Same pattern for `aksara-corpus-uploader.service`.
|