AksaraLLM 20B (dense)

Status: architecture + tokenizer published. Weights are NOT YET trained. This repository currently holds the architecture config and tokenizer. The from-scratch pretraining run is blocked on TRC v5p-128 approval; see Roadmap below.

AksaraLLM 20B is a from-scratch, Indonesian-first decoder-only transformer designed to serve Indonesian (id), Malay (ms), Javanese (jv), Sundanese (su), with English (en) and source code as secondary.

Architecture

Field Value
Family LLaMA-3-style decoder-only transformer
Parameters 20,359,673,856 (20.36 B, with tied embeddings)
Hidden size 6,144
FFN inner 20,480 (SwiGLU)
Layers 42
Attention heads 48 query / 8 KV (GQA, 6:1)
Head dim 128
Vocab 131,072 (BPE byte-level)
Positional RoPE, θ = 1,000,000
Context (pretrain) 8,192
Context (YaRN extend) 32,768
Context (inference target) 131,072
Norm RMSNorm
Embeddings tied

Tokenizer

The tokenizer is already published at Ezekiel999/aksara-tokenizer-20b and mirrored here.

Fertility (held-out samples):

Language Source tokens/word Target
English FineWeb 1.280 ≤ 1.40
Indonesian Wikipedia 1.357 ≤ 1.60
Indonesian CulturaX web 1.215 ≤ 1.60
Malay Wikipedia 1.368 ≤ 1.60
Javanese Wikipedia 1.657 ≤ 1.80

Roadmap

Phase Status Compute Target date
1. Architecture + tokenizer Done CPU 2026-04
2. Corpus build (400–600B tokens) 🔄 in progress v6e-8 2026-05
3. Pretrain phase 1 (8k context, 400B tokens) ⏸ blocked on TRC v5p-128 v5p-128, 4–5 weeks 2026-06
4. YaRN context extension (32k) pending v5p-128, ~4 days 2026-07
5. SFT pending v5p-64 or v6e-8 2026-07
6. DPO / ORPO pending v5p-64 or v6e-8 2026-07
7. Eval + release (GGUF) pending CPU 2026-08

Usage (tokenizer only)

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
print(tok("Halo AksaraLLM", add_special_tokens=False).input_ids)

Weights will be published here once pretraining completes.

Citation

@misc{aksarallm2026,
  title = {AksaraLLM 20B: A From-Scratch Indonesian-First Language Model},
  author = {AksaraLLM Team},
  year = {2026},
  url = {https://huggingface.co/AksaraLLM/AksaraLLM-20B}
}

License

Apache-2.0. Pre-training data attribution will be documented with the final weights.

Downloads last month
364
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support