AksaraLLM-20B / docs /TRC_v5p_application.md
Ezekiel999's picture
Add TRC v5p-128 application draft
bfd0211 verified
# TRC v5p-128 Application β€” AksaraLLM 20B
**Apply at:** https://sites.research.google/trc/about/ (click "Apply now"), or reply to your existing TRC onboarding email thread with the upgrade request.
**Recommended ask:** v5p-128 preemptible, 6 weeks, `europe-west4-a` (same zone as your current `aksara-20b-v6e-8`, keeps data-locality with `gs://aksarallm20b-eu/`).
---
## Email body (copy-paste, edit the `[bracketed]` bits)
> **Subject:** TRC upgrade request β€” v5p-128 for AksaraLLM 20B Indonesian pretrain (from v6e-8 current)
>
> Hi TRC team,
>
> I'm currently using `aksara-20b-v6e-8` (europe-west4-a) under TRC and would like to request an upgrade to **v5p-128 (preemptible, 6 weeks, europe-west4-a)** for the pretrain phase of **AksaraLLM 20B**, a from-scratch Indonesian-first LLM. v6e-8 is sufficient for smoke tests and SFT but gives a 6–9 month wall-clock for a 20B pretrain on ~400–600B tokens, whereas v5p-128 lands that in 4–5 weeks at healthy MFU.
>
> **Project:** AksaraLLM 20B β€” a LLaMA-3-style decoder-only transformer (GQA 48q/8kv, RoPE ΞΈ=1M, SwiGLU, RMSNorm, tied embeddings) targeting Indonesian, Malay, Javanese, and Sundanese with English and code as secondary. Dense 20.36B params, 8,192 train context extending to 131,072 at inference via YaRN.
>
> **Readiness evidence (already built on v6e-8):**
> - Tokenizer live at https://huggingface.co/Ezekiel999/aksara-tokenizer-20b β€” 131,072 BPE vocab, fertility id=1.357, en=1.280, ms=1.368, jv=1.657 (all below targets)
> - Pretrain runner (EasyDeL / JAX / Flax NNX, SPMD mesh, Orbax checkpointing, W&B) validated end-to-end on v6e-8: 20-step smoke test with loss decreasing 11.83β†’11.61 at ~39k tok/s on a 200M proxy model, corpus streamed from `gs://aksarallm20b-eu/smoke_parquet/`
> - Corpus build pipeline (FineWeb + FineWeb-2-id + CulturaX + Indo4B + Dolma + The-Stack-v2, with fastText LID, Gopher quality filters, MinHash-LSH dedup, 13-gram decontamination against IndoMMLU/xCOPA/XNLI-id/TyDiQA-id/MMLU/HellaSwag/ARC/GSM8K) is in code; we will use v6e-8 to produce the 400–600B-token Parquet corpus under `gs://aksarallm20b-eu/pretrain/` while we wait for v5p.
> - GCP project `aksarallm-tpu`, co-located EU bucket `gs://aksarallm20b-eu/` (12.16 GB sample corpus already uploaded)
> - Repository: https://github.com/cahyohackids/AksaraLLM (branch `devin/1776993538-20b-pipeline-fixes`)
>
> **Compute plan for v5p-128:**
> - Phase 1 pretrain: 200k steps Γ— 2 Mi tokens/step = 419B tokens at 8k context, ~4.5 weeks wall-clock at ~45% MFU
> - Phase 2 YaRN context extension: 10k steps at 32k context, ~4 days
> - Eval + smoke SFT validation: 2 days
>
> **Recovery plan for preemption:** Orbax async sharded checkpoints every 500 steps (∼1h) to `gs://aksarallm20b-eu/ckpt/`, automatic resume. Expected preempt cost under 10% of wall-clock.
>
> **Open-source deliverables:** Apache-2.0 base weights, SFT+DPO variants, technical report on Hugging Face `AksaraLLM/` org. First sizable Indonesian from-scratch 20B, explicitly covering JV/SU/MS tails that are underrepresented in current multilingual models.
>
> Grateful for the v6e-8 access so far β€” the readiness work above was all done on it. Happy to share W&B run logs for the smoke test if useful.
>
> Thanks,
> [Your name]
> [Affiliation / lab / company]
> GitHub: https://github.com/cahyohackids
> Hugging Face: https://huggingface.co/AksaraLLM
---
## Readiness packet (attach or link in the email)
| Artifact | Link / Location |
|---|---|
| Tokenizer | https://huggingface.co/Ezekiel999/aksara-tokenizer-20b |
| Architecture config | `configs/aksara_20b_dense.json` on branch |
| Pretrain runner | `scripts/train_20b_pretrain.py` on branch |
| Corpus builder | `scripts/build_pretrain_corpus_v2.py` on branch |
| Preflight gates | `scripts/preflight_20b.py` on branch |
| Execution plan | `docs/aksara_20b_execution_plan.md` on branch |
| Smoke-test log excerpt | `step=0 loss=11.83 tok/s=33k`, `step=10 loss=11.61 tok/s=40k`, clean exit |
| Current TPU | `aksara-20b-v6e-8`, europe-west4-a, READY |
| Bucket (co-located) | `gs://aksarallm20b-eu/` (12.16 GB sample corpus + tokenizer + smoke parquet) |
---
## Tips for approval
1. **Emphasize Indonesian-first + underrepresented SEA languages.** TRC is more likely to approve open-science projects serving underrepresented languages than yet-another-English-LLM.
2. **Show the work is already ready to run** β€” you have the tokenizer, the runner, and a validated smoke test. The ask is scale-out, not research.
3. **Preemptible is easier to get approved than on-demand.** The runner already has resume logic so this is OK.
4. **6 weeks is the honest ask.** Asking for 12 weeks will get declined or trimmed; 4 weeks is too tight to include margin for preempt & YaRN phase.
5. **Co-locate with europe-west4-a.** You already have `aksara-20b-v6e-8` there and `gs://aksarallm20b-eu/`. Don't ask for us-east or us-central β€” the TRC team prefers not to spread one project across zones.