--- language: - id license: apache-2.0 library_name: transformers pipeline_tag: text-generation tags: - indonesian - aksarallm - archived - research --- # Kiel-200M-Matured > ⚠️ **Status: research artifact / early experiment, not a usable language model today.** > The tokenizer used at training time was not preserved in this repository, > so the checkpoint cannot be loaded end-to-end with HF `AutoModel` / > `AutoTokenizer`. Output quality on standard Indonesian benchmarks is far > below the org's working models (`Kiel-Pro-0.5B-v3`, `AksaraLLM-Qwen-1.5B-v5-public`). ## What this is A **271M-parameter** decoder-only transformer trained from scratch as part of the early AksaraLLM experiments. Architecture (inferred from weight shapes): | Property | Value | |----------|-------| | Parameters | 271.1M | | Layers | 16 | | Heads | 16 | | Hidden size | 1024 | | FFN size (SwiGLU) | 2816 | | Vocabulary | 32000 | | Context length | 256 | | RMSNorm + RoPE + SwiGLU | yes | ## Measured baseline (Devin audit, CPU eval) We loaded this checkpoint with `aksarallm.model.aksaraLLMModel`, tested several candidate tokenizers (`AksaraLLM/aksara-tokenizer-v1/v2/v3`, Llama-2 SentencePiece, GPT-2 BPE), and ran perplexity on 50 short Indonesian Wikipedia-style sentences plus 5 free-form prompts. Best-case results: - **Perplexity**: BROKEN (tokenizer mismatch — see Limitations) — meaning the model is **not** modelling Indonesian distribution. - **English-stopword ratio in ID-prompted output**: 0.0% - **Indonesian-stopword ratio in ID-prompted output**: 0.0% Sample output for prompt "Indonesia adalah negara": ``` Indonesia adalah negaraate companate compan bet cop4 config G betentent somet compan4ident L coll2 Y great75 from less4 Gil differe�ident L fun speech2 Yost immishalhaps4 eas ind we Qu immis ``` ## Why the original "Skor 10/11 (Grade S)" claim is misleading The score in earlier versions of this README is from a custom 11-question in-house scorecard graded on a tiny SFT set, not from a standard language modelling metric. By any standard LM evaluation (perplexity, response coherence on out-of-distribution prompts, identity calibration), this model does not function. ## Limitations - **Tokenizer not preserved.** Without it, downstream usage is impossible. - **No HF-compatible config.json** in the original training pipeline; the ` aksarallm` package is required for loading. - **Vocab size 32000** — does not match any published AksaraLLM tokenizer (32 768, 64 000) or common open tokenizers (Llama-2 = 32 000). - **Trained on a small mixed corpus** (Wikipedia / SFT pairs), insufficient for general Indonesian generation at this scale. ## What to use instead If you want a small Indonesian LM in the AksaraLLM family, use: - [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, perplexity ≈ 15 on the same eval set. - [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, perplexity ≈ 8.4. ## Citation ``` @misc{aksarallm-kiel-200m-matured, author = {Cahyok Putra and AksaraLLM Community}, title = {Kiel-200M-Matured: early-experiment Indonesian transformer (271M params)}, year = 2025, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/AksaraLLM/Kiel-200M-Matured}}, } ``` ## License Apache 2.0