AksaraLLM-20B / MONITORING.md
Ezekiel999's picture
Add corpus-build monitoring runbook (SSH, tmux, GCS commands)
cfacedb verified
# AksaraLLM 20B β€” Monitoring Runbook
Corpus build is running in a **tmux session on `aksara-20b-v6e-8`**. It will keep running even after this Devin session ends. Use this runbook to check progress, diagnose issues, or stop / restart the build.
## Quick status (from your laptop)
```bash
# 1. Check bucket growth (how much corpus is in GCS)
gcloud storage du -s gs://aksarallm20b-eu/pretrain/v1/
# Per-source breakdown
gcloud storage du -s \
gs://aksarallm20b-eu/pretrain/v1/fineweb \
gs://aksarallm20b-eu/pretrain/v1/fineweb2_id \
gs://aksarallm20b-eu/pretrain/v1/culturax_id \
gs://aksarallm20b-eu/pretrain/v1/culturax_ms \
gs://aksarallm20b-eu/pretrain/v1/fineweb2_jv \
gs://aksarallm20b-eu/pretrain/v1/fineweb2_su \
gs://aksarallm20b-eu/pretrain/v1/culturax_jv \
gs://aksarallm20b-eu/pretrain/v1/culturax_su \
gs://aksarallm20b-eu/pretrain/v1/code_search_net \
gs://aksarallm20b-eu/pretrain/v1/wikipedia_id \
gs://aksarallm20b-eu/pretrain/v1/wikipedia_jv \
gs://aksarallm20b-eu/pretrain/v1/wikipedia_en
# 2. Read the latest manifest
gcloud storage cp gs://aksarallm20b-eu/pretrain/v1/manifest.json - | jq '.per_source_stats'
```
## Detailed status (SSH into TPU)
```bash
gcloud compute tpus tpu-vm ssh aksara-20b-v6e-8 --zone=europe-west4-a --worker=0
```
Once inside:
```bash
# See tmux sessions
tmux ls
# β†’ corpus: 2 windows (created ...)
# Attach to live view (Ctrl-b n = next window, Ctrl-b d = detach)
tmux attach -t corpus
# Or just tail logs without attaching
tail -f ~/corpus_build.log # producer
tail -f ~/corpus_upload.log # uploader
# Local tmpfs state
du -sh /dev/shm/corpus_work/
find /dev/shm/corpus_work -type f | head
```
## Expected progress
| What | Now |
|---|---|
| Sources being processed | fineweb, fineweb2_id, culturax_id, culturax_ms, fineweb2_jv, fineweb2_su, culturax_jv, culturax_su, code_search_net, wikipedia_id, wikipedia_jv, wikipedia_en |
| Target per full run | 100B tokens (budget distributed by mix %) |
| Throughput | ~100k tokens/sec (CPU-bound on one VM) |
| First shard β†’ GCS | already landed (see bucket) |
| Each new shard | ~6 min per 500 MB (~250M tokens) |
| Wall-clock for 100B | ~10–12 days single-threaded |
**If you want faster:** start more parallel producers on different source subsets β€” see "Scaling out" below.
## Key files on the TPU VM
| File | Purpose |
|---|---|
| `~/corpus_build_runner.sh` | Launches producer, auto-restarts on crash up to 20 times |
| `~/corpus_upload_loop.sh` | Wrapper; invokes `corpus_upload_loop.py` |
| `~/corpus_upload_loop.py` | Python+gcsfs mirror of tmpfs β†’ GCS every 10 min, deletes local shards older than 20 min |
| `~/.aksara_env` | Holds `HF_TOKEN`; sourced by runner |
| `~/AksaraLLM/scripts/build_pretrain_corpus_v2.py` | The actual producer |
| `/dev/shm/corpus_work/` | Producer's output (RAM-backed 709 GB tmpfs) |
| `/home/ubuntu/corpus_build.log` | Producer log |
| `/home/ubuntu/corpus_upload.log` | Uploader log |
## Stop the build
```bash
# On TPU VM:
tmux kill-session -t corpus
# Any shards already uploaded stay in GCS. Local tmpfs shards not yet uploaded are lost on VM reboot.
```
## Restart the build (fresh)
```bash
# On TPU VM:
tmux kill-session -t corpus 2>/dev/null
rm -rf /dev/shm/corpus_work/*
tmux new-session -d -s corpus -n producer 'bash ~/corpus_build_runner.sh'
tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'
```
Note: the producer deduplicates within its in-memory state only. A restart forgets previous near-dups, so restarts cost a small amount of extra duplicates. Exact-dedup (`sha256`) is reset too; cross-run dedup should be re-done at consolidation time before the real pretrain run.
## Scaling out (speed up 10Γ—)
The current producer is single-threaded per-source, single-process per-VM. To produce 400B tokens in 2–3 weeks you need ~8–10 parallel producers. Easiest scheme: **partition sources across tmux windows**.
```bash
# Example: 3 parallel producers, each owning different sources
tmux kill-session -t corpus 2>/dev/null
# Producer 1: English (highest volume source)
tmux new-session -d -s corpus -n p1 \
'source ~/.aksara_env && cd ~/AksaraLLM && \
python3 -u scripts/build_pretrain_corpus_v2.py build \
--config configs/aksara_20b_dense.json \
--output-dir /dev/shm/corpus_work \
--assets-dir /dev/shm/pretrain_assets \
--target-total-tokens 400000000000 \
--shard-target-bytes 524288000 \
--no-decontam \
--sources fineweb,wikipedia_en 2>&1 | tee -a ~/corpus_p1.log'
# Producer 2: Indonesian bulk
tmux new-window -t corpus -n p2 \
'source ~/.aksara_env && cd ~/AksaraLLM && \
python3 -u scripts/build_pretrain_corpus_v2.py build \
--config configs/aksara_20b_dense.json \
--output-dir /dev/shm/corpus_work \
--assets-dir /dev/shm/pretrain_assets \
--target-total-tokens 400000000000 \
--shard-target-bytes 524288000 \
--no-decontam \
--sources fineweb2_id,culturax_id,wikipedia_id 2>&1 | tee -a ~/corpus_p2.log'
# Producer 3: Malay + JV + SU
tmux new-window -t corpus -n p3 \
'source ~/.aksara_env && cd ~/AksaraLLM && \
python3 -u scripts/build_pretrain_corpus_v2.py build \
--config configs/aksara_20b_dense.json \
--output-dir /dev/shm/corpus_work \
--assets-dir /dev/shm/pretrain_assets \
--target-total-tokens 400000000000 \
--shard-target-bytes 524288000 \
--no-decontam \
--sources culturax_ms,fineweb2_jv,fineweb2_su,culturax_jv,culturax_su,wikipedia_jv 2>&1 | tee -a ~/corpus_p3.log'
# Keep the uploader running
tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'
```
⚠️ Each producer writes to `/dev/shm/corpus_work/{source}/shard-*.parquet`. Since no two producers share a source (if you partition correctly), they don't collide. The uploader mirrors everything to GCS.
## Troubleshooting
### "bucket size: 0.00 GB" stays zero
Uploader is running but nothing to sync because no shard has flushed yet. Shards flush every ~500 MB of text (~250M tokens). Wait 6–10 min after producer start.
### Producer keeps restarting
Check `tail -50 ~/corpus_build.log` for the actual error. Common causes:
- Rate limiting from HF Hub β†’ wait a few minutes, it retries automatically
- Dataset schema change β†’ lock `datasets` version in requirements
- OOM β†’ `/dev/shm` filling up; the uploader should be purging old shards every 10 min, but if uploader crashed, local files accumulate
### `gcloud storage du` shows it working, but manifest.json missing
Manifest writes after each source completes. Early in the run only the first shard exists without a manifest. Manifest appears once fineweb (or any source) finishes.
### tmux session disappears after VM reboot
This is expected β€” tmux state doesn't survive reboots. TPU preemptible nodes can reboot. To make the producer survive reboots, configure a systemd service (see `Auto-restart on reboot` below).
## Auto-restart on reboot (optional)
Create `/etc/systemd/system/aksara-corpus.service` on TPU VM:
```ini
[Unit]
Description=AksaraLLM corpus build producer
After=network-online.target
[Service]
Type=simple
User=ubuntu
ExecStart=/bin/bash /home/ubuntu/corpus_build_runner.sh
Restart=always
RestartSec=60
StandardOutput=append:/home/ubuntu/corpus_build.log
StandardError=append:/home/ubuntu/corpus_build.log
[Install]
WantedBy=multi-user.target
```
Then: `sudo systemctl daemon-reload && sudo systemctl enable --now aksara-corpus`.
Same pattern for `aksara-corpus-uploader.service`.