TRC v5p-128 Application β AksaraLLM 20B
Apply at: https://sites.research.google/trc/about/ (click "Apply now"), or reply to your existing TRC onboarding email thread with the upgrade request.
Recommended ask: v5p-128 preemptible, 6 weeks, europe-west4-a (same zone as your current aksara-20b-v6e-8, keeps data-locality with gs://aksarallm20b-eu/).
Email body (copy-paste, edit the [bracketed] bits)
Subject: TRC upgrade request β v5p-128 for AksaraLLM 20B Indonesian pretrain (from v6e-8 current)
Hi TRC team,
I'm currently using
aksara-20b-v6e-8(europe-west4-a) under TRC and would like to request an upgrade to v5p-128 (preemptible, 6 weeks, europe-west4-a) for the pretrain phase of AksaraLLM 20B, a from-scratch Indonesian-first LLM. v6e-8 is sufficient for smoke tests and SFT but gives a 6β9 month wall-clock for a 20B pretrain on ~400β600B tokens, whereas v5p-128 lands that in 4β5 weeks at healthy MFU.Project: AksaraLLM 20B β a LLaMA-3-style decoder-only transformer (GQA 48q/8kv, RoPE ΞΈ=1M, SwiGLU, RMSNorm, tied embeddings) targeting Indonesian, Malay, Javanese, and Sundanese with English and code as secondary. Dense 20.36B params, 8,192 train context extending to 131,072 at inference via YaRN.
Readiness evidence (already built on v6e-8):
- Tokenizer live at https://huggingface.co/Ezekiel999/aksara-tokenizer-20b β 131,072 BPE vocab, fertility id=1.357, en=1.280, ms=1.368, jv=1.657 (all below targets)
- Pretrain runner (EasyDeL / JAX / Flax NNX, SPMD mesh, Orbax checkpointing, W&B) validated end-to-end on v6e-8: 20-step smoke test with loss decreasing 11.83β11.61 at ~39k tok/s on a 200M proxy model, corpus streamed from
gs://aksarallm20b-eu/smoke_parquet/- Corpus build pipeline (FineWeb + FineWeb-2-id + CulturaX + Indo4B + Dolma + The-Stack-v2, with fastText LID, Gopher quality filters, MinHash-LSH dedup, 13-gram decontamination against IndoMMLU/xCOPA/XNLI-id/TyDiQA-id/MMLU/HellaSwag/ARC/GSM8K) is in code; we will use v6e-8 to produce the 400β600B-token Parquet corpus under
gs://aksarallm20b-eu/pretrain/while we wait for v5p.- GCP project
aksarallm-tpu, co-located EU bucketgs://aksarallm20b-eu/(12.16 GB sample corpus already uploaded)- Repository: https://github.com/cahyohackids/AksaraLLM (branch
devin/1776993538-20b-pipeline-fixes)Compute plan for v5p-128:
- Phase 1 pretrain: 200k steps Γ 2 Mi tokens/step = 419B tokens at 8k context, ~4.5 weeks wall-clock at ~45% MFU
- Phase 2 YaRN context extension: 10k steps at 32k context, ~4 days
- Eval + smoke SFT validation: 2 days
Recovery plan for preemption: Orbax async sharded checkpoints every 500 steps (βΌ1h) to
gs://aksarallm20b-eu/ckpt/, automatic resume. Expected preempt cost under 10% of wall-clock.Open-source deliverables: Apache-2.0 base weights, SFT+DPO variants, technical report on Hugging Face
AksaraLLM/org. First sizable Indonesian from-scratch 20B, explicitly covering JV/SU/MS tails that are underrepresented in current multilingual models.Grateful for the v6e-8 access so far β the readiness work above was all done on it. Happy to share W&B run logs for the smoke test if useful.
Thanks, [Your name] [Affiliation / lab / company] GitHub: https://github.com/cahyohackids Hugging Face: https://huggingface.co/AksaraLLM
Readiness packet (attach or link in the email)
| Artifact | Link / Location |
|---|---|
| Tokenizer | https://huggingface.co/Ezekiel999/aksara-tokenizer-20b |
| Architecture config | configs/aksara_20b_dense.json on branch |
| Pretrain runner | scripts/train_20b_pretrain.py on branch |
| Corpus builder | scripts/build_pretrain_corpus_v2.py on branch |
| Preflight gates | scripts/preflight_20b.py on branch |
| Execution plan | docs/aksara_20b_execution_plan.md on branch |
| Smoke-test log excerpt | step=0 loss=11.83 tok/s=33k, step=10 loss=11.61 tok/s=40k, clean exit |
| Current TPU | aksara-20b-v6e-8, europe-west4-a, READY |
| Bucket (co-located) | gs://aksarallm20b-eu/ (12.16 GB sample corpus + tokenizer + smoke parquet) |
Tips for approval
- Emphasize Indonesian-first + underrepresented SEA languages. TRC is more likely to approve open-science projects serving underrepresented languages than yet-another-English-LLM.
- Show the work is already ready to run β you have the tokenizer, the runner, and a validated smoke test. The ask is scale-out, not research.
- Preemptible is easier to get approved than on-demand. The runner already has resume logic so this is OK.
- 6 weeks is the honest ask. Asking for 12 weeks will get declined or trimmed; 4 weeks is too tight to include margin for preempt & YaRN phase.
- Co-locate with europe-west4-a. You already have
aksara-20b-v6e-8there andgs://aksarallm20b-eu/. Don't ask for us-east or us-central β the TRC team prefers not to spread one project across zones.