AksaraLLM-20B / README.md
Ezekiel999's picture
Publish architecture config + tokenizer + roadmap (weights pending v5p-128 pretrain)
88202f1 verified
---
license: apache-2.0
language:
- id
- ms
- jv
- su
- en
tags:
- aksarallm
- indonesian
- llama
- from-scratch
- pretraining
library_name: transformers
pipeline_tag: text-generation
---
# AksaraLLM 20B (dense)
> **Status: architecture + tokenizer published. Weights are NOT YET trained.**
> This repository currently holds the architecture config and tokenizer. The
> from-scratch pretraining run is blocked on TRC v5p-128 approval; see
> [Roadmap](#roadmap) below.
AksaraLLM 20B is a **from-scratch, Indonesian-first** decoder-only transformer
designed to serve Indonesian (`id`), Malay (`ms`), Javanese (`jv`), Sundanese
(`su`), with English (`en`) and source code as secondary.
## Architecture
| Field | Value |
|---|---|
| Family | LLaMA-3-style decoder-only transformer |
| Parameters | **20,359,673,856** (20.36 B, with tied embeddings) |
| Hidden size | 6,144 |
| FFN inner | 20,480 (SwiGLU) |
| Layers | 42 |
| Attention heads | 48 query / 8 KV (GQA, 6:1) |
| Head dim | 128 |
| Vocab | 131,072 (BPE byte-level) |
| Positional | RoPE, θ = 1,000,000 |
| Context (pretrain) | 8,192 |
| Context (YaRN extend) | 32,768 |
| Context (inference target) | 131,072 |
| Norm | RMSNorm |
| Embeddings | tied |
## Tokenizer
The tokenizer is already published at
[`Ezekiel999/aksara-tokenizer-20b`](https://huggingface.co/Ezekiel999/aksara-tokenizer-20b)
and mirrored here.
**Fertility** (held-out samples):
| Language | Source | tokens/word | Target |
|---|---|---|---|
| English | FineWeb | 1.280 | ≤ 1.40 |
| Indonesian | Wikipedia | 1.357 | ≤ 1.60 |
| Indonesian | CulturaX web | 1.215 | ≤ 1.60 |
| Malay | Wikipedia | 1.368 | ≤ 1.60 |
| Javanese | Wikipedia | 1.657 | ≤ 1.80 |
## Roadmap
| Phase | Status | Compute | Target date |
|---|---|---|---|
| 1. Architecture + tokenizer | ✅ **Done** | CPU | 2026-04 |
| 2. Corpus build (400–600B tokens) | 🔄 in progress | v6e-8 | 2026-05 |
| 3. Pretrain phase 1 (8k context, 400B tokens) | ⏸ blocked on TRC v5p-128 | v5p-128, 4–5 weeks | 2026-06 |
| 4. YaRN context extension (32k) | pending | v5p-128, ~4 days | 2026-07 |
| 5. SFT | pending | v5p-64 or v6e-8 | 2026-07 |
| 6. DPO / ORPO | pending | v5p-64 or v6e-8 | 2026-07 |
| 7. Eval + release (GGUF) | pending | CPU | 2026-08 |
## Usage (tokenizer only)
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
print(tok("Halo AksaraLLM", add_special_tokens=False).input_ids)
```
Weights will be published here once pretraining completes.
## Citation
```bibtex
@misc{aksarallm2026,
title = {AksaraLLM 20B: A From-Scratch Indonesian-First Language Model},
author = {AksaraLLM Team},
year = {2026},
url = {https://huggingface.co/AksaraLLM/AksaraLLM-20B}
}
```
## License
Apache-2.0. Pre-training data attribution will be documented with the final weights.