# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("AksaraLLM/AksaraLLM-20B")
model = AutoModelForCausalLM.from_pretrained("AksaraLLM/AksaraLLM-20B")
messages = [
{"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))Quick Links
AksaraLLM 20B (dense)
Status: architecture + tokenizer published. Weights are NOT YET trained. This repository currently holds the architecture config and tokenizer. The from-scratch pretraining run is blocked on TRC v5p-128 approval; see Roadmap below.
AksaraLLM 20B is a from-scratch, Indonesian-first decoder-only transformer
designed to serve Indonesian (id), Malay (ms), Javanese (jv), Sundanese
(su), with English (en) and source code as secondary.
Architecture
| Field | Value |
|---|---|
| Family | LLaMA-3-style decoder-only transformer |
| Parameters | 20,359,673,856 (20.36 B, with tied embeddings) |
| Hidden size | 6,144 |
| FFN inner | 20,480 (SwiGLU) |
| Layers | 42 |
| Attention heads | 48 query / 8 KV (GQA, 6:1) |
| Head dim | 128 |
| Vocab | 131,072 (BPE byte-level) |
| Positional | RoPE, θ = 1,000,000 |
| Context (pretrain) | 8,192 |
| Context (YaRN extend) | 32,768 |
| Context (inference target) | 131,072 |
| Norm | RMSNorm |
| Embeddings | tied |
Tokenizer
The tokenizer is already published at
Ezekiel999/aksara-tokenizer-20b
and mirrored here.
Fertility (held-out samples):
| Language | Source | tokens/word | Target |
|---|---|---|---|
| English | FineWeb | 1.280 | ≤ 1.40 |
| Indonesian | Wikipedia | 1.357 | ≤ 1.60 |
| Indonesian | CulturaX web | 1.215 | ≤ 1.60 |
| Malay | Wikipedia | 1.368 | ≤ 1.60 |
| Javanese | Wikipedia | 1.657 | ≤ 1.80 |
Roadmap
| Phase | Status | Compute | Target date |
|---|---|---|---|
| 1. Architecture + tokenizer | ✅ Done | CPU | 2026-04 |
| 2. Corpus build (400–600B tokens) | 🔄 in progress | v6e-8 | 2026-05 |
| 3. Pretrain phase 1 (8k context, 400B tokens) | ⏸ blocked on TRC v5p-128 | v5p-128, 4–5 weeks | 2026-06 |
| 4. YaRN context extension (32k) | pending | v5p-128, ~4 days | 2026-07 |
| 5. SFT | pending | v5p-64 or v6e-8 | 2026-07 |
| 6. DPO / ORPO | pending | v5p-64 or v6e-8 | 2026-07 |
| 7. Eval + release (GGUF) | pending | CPU | 2026-08 |
Usage (tokenizer only)
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
print(tok("Halo AksaraLLM", add_special_tokens=False).input_ids)
Weights will be published here once pretraining completes.
Citation
@misc{aksarallm2026,
title = {AksaraLLM 20B: A From-Scratch Indonesian-First Language Model},
author = {AksaraLLM Team},
year = {2026},
url = {https://huggingface.co/AksaraLLM/AksaraLLM-20B}
}
License
Apache-2.0. Pre-training data attribution will be documented with the final weights.
- Downloads last month
- 364
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AksaraLLM/AksaraLLM-20B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)