| # AksaraLLM 20B β Monitoring Runbook |
|
|
| Corpus build is running in a **tmux session on `aksara-20b-v6e-8`**. It will keep running even after this Devin session ends. Use this runbook to check progress, diagnose issues, or stop / restart the build. |
|
|
| ## Quick status (from your laptop) |
|
|
| ```bash |
| # 1. Check bucket growth (how much corpus is in GCS) |
| gcloud storage du -s gs://aksarallm20b-eu/pretrain/v1/ |
| |
| # Per-source breakdown |
| gcloud storage du -s \ |
| gs://aksarallm20b-eu/pretrain/v1/fineweb \ |
| gs://aksarallm20b-eu/pretrain/v1/fineweb2_id \ |
| gs://aksarallm20b-eu/pretrain/v1/culturax_id \ |
| gs://aksarallm20b-eu/pretrain/v1/culturax_ms \ |
| gs://aksarallm20b-eu/pretrain/v1/fineweb2_jv \ |
| gs://aksarallm20b-eu/pretrain/v1/fineweb2_su \ |
| gs://aksarallm20b-eu/pretrain/v1/culturax_jv \ |
| gs://aksarallm20b-eu/pretrain/v1/culturax_su \ |
| gs://aksarallm20b-eu/pretrain/v1/code_search_net \ |
| gs://aksarallm20b-eu/pretrain/v1/wikipedia_id \ |
| gs://aksarallm20b-eu/pretrain/v1/wikipedia_jv \ |
| gs://aksarallm20b-eu/pretrain/v1/wikipedia_en |
| |
| # 2. Read the latest manifest |
| gcloud storage cp gs://aksarallm20b-eu/pretrain/v1/manifest.json - | jq '.per_source_stats' |
| ``` |
|
|
| ## Detailed status (SSH into TPU) |
|
|
| ```bash |
| gcloud compute tpus tpu-vm ssh aksara-20b-v6e-8 --zone=europe-west4-a --worker=0 |
| ``` |
|
|
| Once inside: |
|
|
| ```bash |
| # See tmux sessions |
| tmux ls |
| # β corpus: 2 windows (created ...) |
| |
| # Attach to live view (Ctrl-b n = next window, Ctrl-b d = detach) |
| tmux attach -t corpus |
| |
| # Or just tail logs without attaching |
| tail -f ~/corpus_build.log # producer |
| tail -f ~/corpus_upload.log # uploader |
| |
| # Local tmpfs state |
| du -sh /dev/shm/corpus_work/ |
| find /dev/shm/corpus_work -type f | head |
| ``` |
|
|
| ## Expected progress |
|
|
| | What | Now | |
| |---|---| |
| | Sources being processed | fineweb, fineweb2_id, culturax_id, culturax_ms, fineweb2_jv, fineweb2_su, culturax_jv, culturax_su, code_search_net, wikipedia_id, wikipedia_jv, wikipedia_en | |
| | Target per full run | 100B tokens (budget distributed by mix %) | |
| | Throughput | ~100k tokens/sec (CPU-bound on one VM) | |
| | First shard β GCS | already landed (see bucket) | |
| | Each new shard | ~6 min per 500 MB (~250M tokens) | |
| | Wall-clock for 100B | ~10β12 days single-threaded | |
|
|
| **If you want faster:** start more parallel producers on different source subsets β see "Scaling out" below. |
|
|
| ## Key files on the TPU VM |
|
|
| | File | Purpose | |
| |---|---| |
| | `~/corpus_build_runner.sh` | Launches producer, auto-restarts on crash up to 20 times | |
| | `~/corpus_upload_loop.sh` | Wrapper; invokes `corpus_upload_loop.py` | |
| | `~/corpus_upload_loop.py` | Python+gcsfs mirror of tmpfs β GCS every 10 min, deletes local shards older than 20 min | |
| | `~/.aksara_env` | Holds `HF_TOKEN`; sourced by runner | |
| | `~/AksaraLLM/scripts/build_pretrain_corpus_v2.py` | The actual producer | |
| | `/dev/shm/corpus_work/` | Producer's output (RAM-backed 709 GB tmpfs) | |
| | `/home/ubuntu/corpus_build.log` | Producer log | |
| | `/home/ubuntu/corpus_upload.log` | Uploader log | |
|
|
| ## Stop the build |
|
|
| ```bash |
| # On TPU VM: |
| tmux kill-session -t corpus |
| # Any shards already uploaded stay in GCS. Local tmpfs shards not yet uploaded are lost on VM reboot. |
| ``` |
|
|
| ## Restart the build (fresh) |
|
|
| ```bash |
| # On TPU VM: |
| tmux kill-session -t corpus 2>/dev/null |
| rm -rf /dev/shm/corpus_work/* |
| tmux new-session -d -s corpus -n producer 'bash ~/corpus_build_runner.sh' |
| tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh' |
| ``` |
|
|
| Note: the producer deduplicates within its in-memory state only. A restart forgets previous near-dups, so restarts cost a small amount of extra duplicates. Exact-dedup (`sha256`) is reset too; cross-run dedup should be re-done at consolidation time before the real pretrain run. |
|
|
| ## Scaling out (speed up 10Γ) |
|
|
| The current producer is single-threaded per-source, single-process per-VM. To produce 400B tokens in 2β3 weeks you need ~8β10 parallel producers. Easiest scheme: **partition sources across tmux windows**. |
|
|
| ```bash |
| # Example: 3 parallel producers, each owning different sources |
| tmux kill-session -t corpus 2>/dev/null |
| |
| # Producer 1: English (highest volume source) |
| tmux new-session -d -s corpus -n p1 \ |
| 'source ~/.aksara_env && cd ~/AksaraLLM && \ |
| python3 -u scripts/build_pretrain_corpus_v2.py build \ |
| --config configs/aksara_20b_dense.json \ |
| --output-dir /dev/shm/corpus_work \ |
| --assets-dir /dev/shm/pretrain_assets \ |
| --target-total-tokens 400000000000 \ |
| --shard-target-bytes 524288000 \ |
| --no-decontam \ |
| --sources fineweb,wikipedia_en 2>&1 | tee -a ~/corpus_p1.log' |
| |
| # Producer 2: Indonesian bulk |
| tmux new-window -t corpus -n p2 \ |
| 'source ~/.aksara_env && cd ~/AksaraLLM && \ |
| python3 -u scripts/build_pretrain_corpus_v2.py build \ |
| --config configs/aksara_20b_dense.json \ |
| --output-dir /dev/shm/corpus_work \ |
| --assets-dir /dev/shm/pretrain_assets \ |
| --target-total-tokens 400000000000 \ |
| --shard-target-bytes 524288000 \ |
| --no-decontam \ |
| --sources fineweb2_id,culturax_id,wikipedia_id 2>&1 | tee -a ~/corpus_p2.log' |
| |
| # Producer 3: Malay + JV + SU |
| tmux new-window -t corpus -n p3 \ |
| 'source ~/.aksara_env && cd ~/AksaraLLM && \ |
| python3 -u scripts/build_pretrain_corpus_v2.py build \ |
| --config configs/aksara_20b_dense.json \ |
| --output-dir /dev/shm/corpus_work \ |
| --assets-dir /dev/shm/pretrain_assets \ |
| --target-total-tokens 400000000000 \ |
| --shard-target-bytes 524288000 \ |
| --no-decontam \ |
| --sources culturax_ms,fineweb2_jv,fineweb2_su,culturax_jv,culturax_su,wikipedia_jv 2>&1 | tee -a ~/corpus_p3.log' |
| |
| # Keep the uploader running |
| tmux new-window -t corpus -n uploader 'bash ~/corpus_upload_loop.sh' |
| ``` |
|
|
| β οΈ Each producer writes to `/dev/shm/corpus_work/{source}/shard-*.parquet`. Since no two producers share a source (if you partition correctly), they don't collide. The uploader mirrors everything to GCS. |
|
|
| ## Troubleshooting |
|
|
| ### "bucket size: 0.00 GB" stays zero |
| Uploader is running but nothing to sync because no shard has flushed yet. Shards flush every ~500 MB of text (~250M tokens). Wait 6β10 min after producer start. |
|
|
| ### Producer keeps restarting |
| Check `tail -50 ~/corpus_build.log` for the actual error. Common causes: |
| - Rate limiting from HF Hub β wait a few minutes, it retries automatically |
| - Dataset schema change β lock `datasets` version in requirements |
| - OOM β `/dev/shm` filling up; the uploader should be purging old shards every 10 min, but if uploader crashed, local files accumulate |
|
|
| ### `gcloud storage du` shows it working, but manifest.json missing |
| Manifest writes after each source completes. Early in the run only the first shard exists without a manifest. Manifest appears once fineweb (or any source) finishes. |
|
|
| ### tmux session disappears after VM reboot |
| This is expected β tmux state doesn't survive reboots. TPU preemptible nodes can reboot. To make the producer survive reboots, configure a systemd service (see `Auto-restart on reboot` below). |
|
|
| ## Auto-restart on reboot (optional) |
|
|
| Create `/etc/systemd/system/aksara-corpus.service` on TPU VM: |
|
|
| ```ini |
| [Unit] |
| Description=AksaraLLM corpus build producer |
| After=network-online.target |
| |
| [Service] |
| Type=simple |
| User=ubuntu |
| ExecStart=/bin/bash /home/ubuntu/corpus_build_runner.sh |
| Restart=always |
| RestartSec=60 |
| StandardOutput=append:/home/ubuntu/corpus_build.log |
| StandardError=append:/home/ubuntu/corpus_build.log |
| |
| [Install] |
| WantedBy=multi-user.target |
| ``` |
|
|
| Then: `sudo systemctl daemon-reload && sudo systemctl enable --now aksara-corpus`. |
|
|
| Same pattern for `aksara-corpus-uploader.service`. |
|
|