Publish architecture config + tokenizer + roadmap (weights pending v5p-128 pretrain)
88202f1 verified | license: apache-2.0 | |
| language: | |
| - id | |
| - ms | |
| - jv | |
| - su | |
| - en | |
| tags: | |
| - aksarallm | |
| - indonesian | |
| - llama | |
| - from-scratch | |
| - pretraining | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| # AksaraLLM 20B (dense) | |
| > **Status: architecture + tokenizer published. Weights are NOT YET trained.** | |
| > This repository currently holds the architecture config and tokenizer. The | |
| > from-scratch pretraining run is blocked on TRC v5p-128 approval; see | |
| > [Roadmap](#roadmap) below. | |
| AksaraLLM 20B is a **from-scratch, Indonesian-first** decoder-only transformer | |
| designed to serve Indonesian (`id`), Malay (`ms`), Javanese (`jv`), Sundanese | |
| (`su`), with English (`en`) and source code as secondary. | |
| ## Architecture | |
| | Field | Value | | |
| |---|---| | |
| | Family | LLaMA-3-style decoder-only transformer | | |
| | Parameters | **20,359,673,856** (20.36 B, with tied embeddings) | | |
| | Hidden size | 6,144 | | |
| | FFN inner | 20,480 (SwiGLU) | | |
| | Layers | 42 | | |
| | Attention heads | 48 query / 8 KV (GQA, 6:1) | | |
| | Head dim | 128 | | |
| | Vocab | 131,072 (BPE byte-level) | | |
| | Positional | RoPE, θ = 1,000,000 | | |
| | Context (pretrain) | 8,192 | | |
| | Context (YaRN extend) | 32,768 | | |
| | Context (inference target) | 131,072 | | |
| | Norm | RMSNorm | | |
| | Embeddings | tied | | |
| ## Tokenizer | |
| The tokenizer is already published at | |
| [`Ezekiel999/aksara-tokenizer-20b`](https://huggingface.co/Ezekiel999/aksara-tokenizer-20b) | |
| and mirrored here. | |
| **Fertility** (held-out samples): | |
| | Language | Source | tokens/word | Target | | |
| |---|---|---|---| | |
| | English | FineWeb | 1.280 | ≤ 1.40 | | |
| | Indonesian | Wikipedia | 1.357 | ≤ 1.60 | | |
| | Indonesian | CulturaX web | 1.215 | ≤ 1.60 | | |
| | Malay | Wikipedia | 1.368 | ≤ 1.60 | | |
| | Javanese | Wikipedia | 1.657 | ≤ 1.80 | | |
| ## Roadmap | |
| | Phase | Status | Compute | Target date | | |
| |---|---|---|---| | |
| | 1. Architecture + tokenizer | ✅ **Done** | CPU | 2026-04 | | |
| | 2. Corpus build (400–600B tokens) | 🔄 in progress | v6e-8 | 2026-05 | | |
| | 3. Pretrain phase 1 (8k context, 400B tokens) | ⏸ blocked on TRC v5p-128 | v5p-128, 4–5 weeks | 2026-06 | | |
| | 4. YaRN context extension (32k) | pending | v5p-128, ~4 days | 2026-07 | | |
| | 5. SFT | pending | v5p-64 or v6e-8 | 2026-07 | | |
| | 6. DPO / ORPO | pending | v5p-64 or v6e-8 | 2026-07 | | |
| | 7. Eval + release (GGUF) | pending | CPU | 2026-08 | | |
| ## Usage (tokenizer only) | |
| ```python | |
| from transformers import AutoTokenizer | |
| tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b") | |
| print(tok("Halo AksaraLLM", add_special_tokens=False).input_ids) | |
| ``` | |
| Weights will be published here once pretraining completes. | |
| ## Citation | |
| ```bibtex | |
| @misc{aksarallm2026, | |
| title = {AksaraLLM 20B: A From-Scratch Indonesian-First Language Model}, | |
| author = {AksaraLLM Team}, | |
| year = {2026}, | |
| url = {https://huggingface.co/AksaraLLM/AksaraLLM-20B} | |
| } | |
| ``` | |
| ## License | |
| Apache-2.0. Pre-training data attribution will be documented with the final weights. | |