# TRC v5p-128 Application — AksaraLLM 20B **Apply at:** https://sites.research.google/trc/about/ (click "Apply now"), or reply to your existing TRC onboarding email thread with the upgrade request. **Recommended ask:** v5p-128 preemptible, 6 weeks, `europe-west4-a` (same zone as your current `aksara-20b-v6e-8`, keeps data-locality with `gs://aksarallm20b-eu/`). --- ## Email body (copy-paste, edit the `[bracketed]` bits) > **Subject:** TRC upgrade request — v5p-128 for AksaraLLM 20B Indonesian pretrain (from v6e-8 current) > > Hi TRC team, > > I'm currently using `aksara-20b-v6e-8` (europe-west4-a) under TRC and would like to request an upgrade to **v5p-128 (preemptible, 6 weeks, europe-west4-a)** for the pretrain phase of **AksaraLLM 20B**, a from-scratch Indonesian-first LLM. v6e-8 is sufficient for smoke tests and SFT but gives a 6–9 month wall-clock for a 20B pretrain on ~400–600B tokens, whereas v5p-128 lands that in 4–5 weeks at healthy MFU. > > **Project:** AksaraLLM 20B — a LLaMA-3-style decoder-only transformer (GQA 48q/8kv, RoPE θ=1M, SwiGLU, RMSNorm, tied embeddings) targeting Indonesian, Malay, Javanese, and Sundanese with English and code as secondary. Dense 20.36B params, 8,192 train context extending to 131,072 at inference via YaRN. > > **Readiness evidence (already built on v6e-8):** > - Tokenizer live at https://huggingface.co/Ezekiel999/aksara-tokenizer-20b — 131,072 BPE vocab, fertility id=1.357, en=1.280, ms=1.368, jv=1.657 (all below targets) > - Pretrain runner (EasyDeL / JAX / Flax NNX, SPMD mesh, Orbax checkpointing, W&B) validated end-to-end on v6e-8: 20-step smoke test with loss decreasing 11.83→11.61 at ~39k tok/s on a 200M proxy model, corpus streamed from `gs://aksarallm20b-eu/smoke_parquet/` > - Corpus build pipeline (FineWeb + FineWeb-2-id + CulturaX + Indo4B + Dolma + The-Stack-v2, with fastText LID, Gopher quality filters, MinHash-LSH dedup, 13-gram decontamination against IndoMMLU/xCOPA/XNLI-id/TyDiQA-id/MMLU/HellaSwag/ARC/GSM8K) is in code; we will use v6e-8 to produce the 400–600B-token Parquet corpus under `gs://aksarallm20b-eu/pretrain/` while we wait for v5p. > - GCP project `aksarallm-tpu`, co-located EU bucket `gs://aksarallm20b-eu/` (12.16 GB sample corpus already uploaded) > - Repository: https://github.com/cahyohackids/AksaraLLM (branch `devin/1776993538-20b-pipeline-fixes`) > > **Compute plan for v5p-128:** > - Phase 1 pretrain: 200k steps × 2 Mi tokens/step = 419B tokens at 8k context, ~4.5 weeks wall-clock at ~45% MFU > - Phase 2 YaRN context extension: 10k steps at 32k context, ~4 days > - Eval + smoke SFT validation: 2 days > > **Recovery plan for preemption:** Orbax async sharded checkpoints every 500 steps (∼1h) to `gs://aksarallm20b-eu/ckpt/`, automatic resume. Expected preempt cost under 10% of wall-clock. > > **Open-source deliverables:** Apache-2.0 base weights, SFT+DPO variants, technical report on Hugging Face `AksaraLLM/` org. First sizable Indonesian from-scratch 20B, explicitly covering JV/SU/MS tails that are underrepresented in current multilingual models. > > Grateful for the v6e-8 access so far — the readiness work above was all done on it. Happy to share W&B run logs for the smoke test if useful. > > Thanks, > [Your name] > [Affiliation / lab / company] > GitHub: https://github.com/cahyohackids > Hugging Face: https://huggingface.co/AksaraLLM --- ## Readiness packet (attach or link in the email) | Artifact | Link / Location | |---|---| | Tokenizer | https://huggingface.co/Ezekiel999/aksara-tokenizer-20b | | Architecture config | `configs/aksara_20b_dense.json` on branch | | Pretrain runner | `scripts/train_20b_pretrain.py` on branch | | Corpus builder | `scripts/build_pretrain_corpus_v2.py` on branch | | Preflight gates | `scripts/preflight_20b.py` on branch | | Execution plan | `docs/aksara_20b_execution_plan.md` on branch | | Smoke-test log excerpt | `step=0 loss=11.83 tok/s=33k`, `step=10 loss=11.61 tok/s=40k`, clean exit | | Current TPU | `aksara-20b-v6e-8`, europe-west4-a, READY | | Bucket (co-located) | `gs://aksarallm20b-eu/` (12.16 GB sample corpus + tokenizer + smoke parquet) | --- ## Tips for approval 1. **Emphasize Indonesian-first + underrepresented SEA languages.** TRC is more likely to approve open-science projects serving underrepresented languages than yet-another-English-LLM. 2. **Show the work is already ready to run** — you have the tokenizer, the runner, and a validated smoke test. The ask is scale-out, not research. 3. **Preemptible is easier to get approved than on-demand.** The runner already has resume logic so this is OK. 4. **6 weeks is the honest ask.** Asking for 12 weeks will get declined or trimmed; 4 weeks is too tight to include margin for preempt & YaRN phase. 5. **Co-locate with europe-west4-a.** You already have `aksara-20b-v6e-8` there and `gs://aksarallm20b-eu/`. Don't ask for us-east or us-central — the TRC team prefers not to spread one project across zones.