--- license: apache-2.0 language: - id - ms - jv - su - en tags: - aksarallm - indonesian - llama - from-scratch - pretraining library_name: transformers pipeline_tag: text-generation --- # AksaraLLM 20B (dense) > **Status: architecture + tokenizer published. Weights are NOT YET trained.** > This repository currently holds the architecture config and tokenizer. The > from-scratch pretraining run is blocked on TRC v5p-128 approval; see > [Roadmap](#roadmap) below. AksaraLLM 20B is a **from-scratch, Indonesian-first** decoder-only transformer designed to serve Indonesian (`id`), Malay (`ms`), Javanese (`jv`), Sundanese (`su`), with English (`en`) and source code as secondary. ## Architecture | Field | Value | |---|---| | Family | LLaMA-3-style decoder-only transformer | | Parameters | **20,359,673,856** (20.36 B, with tied embeddings) | | Hidden size | 6,144 | | FFN inner | 20,480 (SwiGLU) | | Layers | 42 | | Attention heads | 48 query / 8 KV (GQA, 6:1) | | Head dim | 128 | | Vocab | 131,072 (BPE byte-level) | | Positional | RoPE, θ = 1,000,000 | | Context (pretrain) | 8,192 | | Context (YaRN extend) | 32,768 | | Context (inference target) | 131,072 | | Norm | RMSNorm | | Embeddings | tied | ## Tokenizer The tokenizer is already published at [`Ezekiel999/aksara-tokenizer-20b`](https://huggingface.co/Ezekiel999/aksara-tokenizer-20b) and mirrored here. **Fertility** (held-out samples): | Language | Source | tokens/word | Target | |---|---|---|---| | English | FineWeb | 1.280 | ≤ 1.40 | | Indonesian | Wikipedia | 1.357 | ≤ 1.60 | | Indonesian | CulturaX web | 1.215 | ≤ 1.60 | | Malay | Wikipedia | 1.368 | ≤ 1.60 | | Javanese | Wikipedia | 1.657 | ≤ 1.80 | ## Roadmap | Phase | Status | Compute | Target date | |---|---|---|---| | 1. Architecture + tokenizer | ✅ **Done** | CPU | 2026-04 | | 2. Corpus build (400–600B tokens) | 🔄 in progress | v6e-8 | 2026-05 | | 3. Pretrain phase 1 (8k context, 400B tokens) | ⏸ blocked on TRC v5p-128 | v5p-128, 4–5 weeks | 2026-06 | | 4. YaRN context extension (32k) | pending | v5p-128, ~4 days | 2026-07 | | 5. SFT | pending | v5p-64 or v6e-8 | 2026-07 | | 6. DPO / ORPO | pending | v5p-64 or v6e-8 | 2026-07 | | 7. Eval + release (GGUF) | pending | CPU | 2026-08 | ## Usage (tokenizer only) ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b") print(tok("Halo AksaraLLM", add_special_tokens=False).input_ids) ``` Weights will be published here once pretraining completes. ## Citation ```bibtex @misc{aksarallm2026, title = {AksaraLLM 20B: A From-Scratch Indonesian-First Language Model}, author = {AksaraLLM Team}, year = {2026}, url = {https://huggingface.co/AksaraLLM/AksaraLLM-20B} } ``` ## License Apache-2.0. Pre-training data attribution will be documented with the final weights.