AksaraLLM-20B / MONITORING.md
Ezekiel999's picture
Add corpus-build monitoring runbook (SSH, tmux, GCS commands)
cfacedb verified

AksaraLLM 20B β€” Monitoring Runbook

Corpus build is running in a tmux session on aksara-20b-v6e-8. It will keep running even after this Devin session ends. Use this runbook to check progress, diagnose issues, or stop / restart the build.

Quick status (from your laptop)

# 1. Check bucket growth (how much corpus is in GCS)
gcloud storage du -s gs://aksarallm20b-eu/pretrain/v1/

# Per-source breakdown
gcloud storage du -s \
  gs://aksarallm20b-eu/pretrain/v1/fineweb \
  gs://aksarallm20b-eu/pretrain/v1/fineweb2_id \
  gs://aksarallm20b-eu/pretrain/v1/culturax_id \
  gs://aksarallm20b-eu/pretrain/v1/culturax_ms \
  gs://aksarallm20b-eu/pretrain/v1/fineweb2_jv \
  gs://aksarallm20b-eu/pretrain/v1/fineweb2_su \
  gs://aksarallm20b-eu/pretrain/v1/culturax_jv \
  gs://aksarallm20b-eu/pretrain/v1/culturax_su \
  gs://aksarallm20b-eu/pretrain/v1/code_search_net \
  gs://aksarallm20b-eu/pretrain/v1/wikipedia_id \
  gs://aksarallm20b-eu/pretrain/v1/wikipedia_jv \
  gs://aksarallm20b-eu/pretrain/v1/wikipedia_en

# 2. Read the latest manifest
gcloud storage cp gs://aksarallm20b-eu/pretrain/v1/manifest.json - | jq '.per_source_stats'

Detailed status (SSH into TPU)

gcloud compute tpus tpu-vm ssh aksara-20b-v6e-8 --zone=europe-west4-a --worker=0

Once inside:

# See tmux sessions
tmux ls
# β†’ corpus: 2 windows (created ...)

# Attach to live view (Ctrl-b n = next window, Ctrl-b d = detach)
tmux attach -t corpus

# Or just tail logs without attaching
tail -f ~/corpus_build.log      # producer
tail -f ~/corpus_upload.log     # uploader

# Local tmpfs state
du -sh /dev/shm/corpus_work/
find /dev/shm/corpus_work -type f | head

Expected progress

What Now
Sources being processed fineweb, fineweb2_id, culturax_id, culturax_ms, fineweb2_jv, fineweb2_su, culturax_jv, culturax_su, code_search_net, wikipedia_id, wikipedia_jv, wikipedia_en
Target per full run 100B tokens (budget distributed by mix %)
Throughput ~100k tokens/sec (CPU-bound on one VM)
First shard β†’ GCS already landed (see bucket)
Each new shard 6 min per 500 MB (250M tokens)
Wall-clock for 100B ~10–12 days single-threaded

If you want faster: start more parallel producers on different source subsets β€” see "Scaling out" below.

Key files on the TPU VM

File Purpose
~/corpus_build_runner.sh Launches producer, auto-restarts on crash up to 20 times
~/corpus_upload_loop.sh Wrapper; invokes corpus_upload_loop.py
~/corpus_upload_loop.py Python+gcsfs mirror of tmpfs β†’ GCS every 10 min, deletes local shards older than 20 min
~/.aksara_env Holds HF_TOKEN; sourced by runner
~/AksaraLLM/scripts/build_pretrain_corpus_v2.py The actual producer
/dev/shm/corpus_work/ Producer's output (RAM-backed 709 GB tmpfs)
/home/ubuntu/corpus_build.log Producer log
/home/ubuntu/corpus_upload.log Uploader log

Stop the build

# On TPU VM:
tmux kill-session -t corpus
# Any shards already uploaded stay in GCS. Local tmpfs shards not yet uploaded are lost on VM reboot.

Restart the build (fresh)

# On TPU VM:
tmux kill-session -t corpus 2>/dev/null
rm -rf /dev/shm/corpus_work/*
tmux new-session -d -s corpus -n producer 'bash ~/corpus_build_runner.sh'
tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'

Note: the producer deduplicates within its in-memory state only. A restart forgets previous near-dups, so restarts cost a small amount of extra duplicates. Exact-dedup (sha256) is reset too; cross-run dedup should be re-done at consolidation time before the real pretrain run.

Scaling out (speed up 10Γ—)

The current producer is single-threaded per-source, single-process per-VM. To produce 400B tokens in 2–3 weeks you need ~8–10 parallel producers. Easiest scheme: partition sources across tmux windows.

# Example: 3 parallel producers, each owning different sources
tmux kill-session -t corpus 2>/dev/null

# Producer 1: English (highest volume source)
tmux new-session -d -s corpus -n p1 \
    'source ~/.aksara_env && cd ~/AksaraLLM && \
     python3 -u scripts/build_pretrain_corpus_v2.py build \
       --config configs/aksara_20b_dense.json \
       --output-dir /dev/shm/corpus_work \
       --assets-dir /dev/shm/pretrain_assets \
       --target-total-tokens 400000000000 \
       --shard-target-bytes 524288000 \
       --no-decontam \
       --sources fineweb,wikipedia_en 2>&1 | tee -a ~/corpus_p1.log'

# Producer 2: Indonesian bulk
tmux new-window -t corpus -n p2 \
    'source ~/.aksara_env && cd ~/AksaraLLM && \
     python3 -u scripts/build_pretrain_corpus_v2.py build \
       --config configs/aksara_20b_dense.json \
       --output-dir /dev/shm/corpus_work \
       --assets-dir /dev/shm/pretrain_assets \
       --target-total-tokens 400000000000 \
       --shard-target-bytes 524288000 \
       --no-decontam \
       --sources fineweb2_id,culturax_id,wikipedia_id 2>&1 | tee -a ~/corpus_p2.log'

# Producer 3: Malay + JV + SU
tmux new-window -t corpus -n p3 \
    'source ~/.aksara_env && cd ~/AksaraLLM && \
     python3 -u scripts/build_pretrain_corpus_v2.py build \
       --config configs/aksara_20b_dense.json \
       --output-dir /dev/shm/corpus_work \
       --assets-dir /dev/shm/pretrain_assets \
       --target-total-tokens 400000000000 \
       --shard-target-bytes 524288000 \
       --no-decontam \
       --sources culturax_ms,fineweb2_jv,fineweb2_su,culturax_jv,culturax_su,wikipedia_jv 2>&1 | tee -a ~/corpus_p3.log'

# Keep the uploader running
tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh'

⚠️ Each producer writes to /dev/shm/corpus_work/{source}/shard-*.parquet. Since no two producers share a source (if you partition correctly), they don't collide. The uploader mirrors everything to GCS.

Troubleshooting

"bucket size: 0.00 GB" stays zero

Uploader is running but nothing to sync because no shard has flushed yet. Shards flush every 500 MB of text (250M tokens). Wait 6–10 min after producer start.

Producer keeps restarting

Check tail -50 ~/corpus_build.log for the actual error. Common causes:

  • Rate limiting from HF Hub β†’ wait a few minutes, it retries automatically
  • Dataset schema change β†’ lock datasets version in requirements
  • OOM β†’ /dev/shm filling up; the uploader should be purging old shards every 10 min, but if uploader crashed, local files accumulate

gcloud storage du shows it working, but manifest.json missing

Manifest writes after each source completes. Early in the run only the first shard exists without a manifest. Manifest appears once fineweb (or any source) finishes.

tmux session disappears after VM reboot

This is expected β€” tmux state doesn't survive reboots. TPU preemptible nodes can reboot. To make the producer survive reboots, configure a systemd service (see Auto-restart on reboot below).

Auto-restart on reboot (optional)

Create /etc/systemd/system/aksara-corpus.service on TPU VM:

[Unit]
Description=AksaraLLM corpus build producer
After=network-online.target

[Service]
Type=simple
User=ubuntu
ExecStart=/bin/bash /home/ubuntu/corpus_build_runner.sh
Restart=always
RestartSec=60
StandardOutput=append:/home/ubuntu/corpus_build.log
StandardError=append:/home/ubuntu/corpus_build.log

[Install]
WantedBy=multi-user.target

Then: sudo systemctl daemon-reload && sudo systemctl enable --now aksara-corpus.

Same pattern for aksara-corpus-uploader.service.