[Devin Audit] add HF YAML front-matter (language, license, base_model, tags) for discoverability

847a962 verified 5 days ago

2.08 kB

language:
  - id
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
  - indonesian
  - aksarallm
  - archived
  - research

Kiel-59M-Matured

⚠️ Status: early experiment. This 85M-parameter decoder-only transformer was trained from scratch as part of the early AksaraLLM line. It uses the GPT-2 BPE tokenizer (50257 vocab) which is not optimal for Indonesian, and the training corpus was limited. By standard perplexity it is not a usable Indonesian language model today.

Architecture

Property	Value
Parameters	85.0M
Layers	8
Heads	8
Hidden size	512
FFN size	2048
Vocabulary	50257 (GPT-2 BPE)
Context length	256
RMSNorm + RoPE + SwiGLU	yes

Measured baseline (Devin audit, CPU eval)

Perplexity (50 ID sentences, GPT-2 tokenizer): 23154 (very high — model not converged)
English-stopword ratio in ID-prompted output: 0.0%
Indonesian-stopword ratio in ID-prompted output: 0.0%

For comparison, the working Indonesian models in this org reach perplexity ≈ 8–15 on the same 50-sentence eval set.

Sample for "Indonesia adalah negara":

Indonesia adalah negaraalum questionich4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Why the previous "Skor 10/11 Grade S" is misleading

That figure is from a custom 11-question in-house scorecard, not from a standard LM evaluation. Perplexity on plain Indonesian text reveals that this checkpoint cannot model the distribution.

Limitations

Wrong tokenizer for the language: GPT-2 BPE is optimised for English.
Severely under-trained at this size + corpus.
No chat template in tokenizer config; treat as a base LM only.

What to use instead

AksaraLLM/Kiel-Pro-0.5B-v3 — 494M Qwen2-based, PPL ≈ 15.
AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public — 1.78B Qwen2-based, PPL ≈ 8.4.

License

Apache 2.0