[Devin Audit] add HF YAML front-matter (language, license, base_model, tags) for discoverability
847a962 verified | language: | |
| - id | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - indonesian | |
| - aksarallm | |
| - archived | |
| - research | |
| # Kiel-59M-Matured | |
| > β οΈ **Status: early experiment.** | |
| > This 85M-parameter decoder-only transformer was trained from scratch | |
| > as part of the early AksaraLLM line. It uses the **GPT-2 BPE** tokenizer | |
| > (50257 vocab) which is not optimal for Indonesian, and the | |
| > training corpus was limited. By standard perplexity it is **not** a usable | |
| > Indonesian language model today. | |
| ## Architecture | |
| | Property | Value | | |
| |----------|-------| | |
| | Parameters | 85.0M | | |
| | Layers | 8 | | |
| | Heads | 8 | | |
| | Hidden size | 512 | | |
| | FFN size | 2048 | | |
| | Vocabulary | 50257 (GPT-2 BPE) | | |
| | Context length | 256 | | |
| | RMSNorm + RoPE + SwiGLU | yes | | |
| ## Measured baseline (Devin audit, CPU eval) | |
| - **Perplexity** (50 ID sentences, GPT-2 tokenizer): 23154 (very high β model not converged) | |
| - **English-stopword ratio in ID-prompted output**: 0.0% | |
| - **Indonesian-stopword ratio in ID-prompted output**: 0.0% | |
| For comparison, the working Indonesian models in this org reach perplexity | |
| β 8β15 on the same 50-sentence eval set. | |
| Sample for "Indonesia adalah negara": | |
| ``` | |
| Indonesia adalah negaraalum questionich4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! | |
| ``` | |
| ## Why the previous "Skor 10/11 Grade S" is misleading | |
| That figure is from a custom 11-question in-house scorecard, not from a | |
| standard LM evaluation. Perplexity on plain Indonesian text reveals that | |
| this checkpoint cannot model the distribution. | |
| ## Limitations | |
| - **Wrong tokenizer for the language**: GPT-2 BPE is optimised for English. | |
| - **Severely under-trained** at this size + corpus. | |
| - **No chat template** in tokenizer config; treat as a base LM only. | |
| ## What to use instead | |
| - [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) β 494M Qwen2-based, PPL β 15. | |
| - [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) β 1.78B Qwen2-based, PPL β 8.4. | |
| ## License | |
| Apache 2.0 | |