[Devin Audit] add HF YAML front-matter (language, license, base_model, tags) for discoverability
183d27e verified | language: | |
| - id | |
| license: apache-2.0 | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| tags: | |
| - indonesian | |
| - aksarallm | |
| - archived | |
| - research | |
| # Kiel-200M-Matured | |
| > ⚠️ **Status: research artifact / early experiment, not a usable language model today.** | |
| > The tokenizer used at training time was not preserved in this repository, | |
| > so the checkpoint cannot be loaded end-to-end with HF `AutoModel` / | |
| > `AutoTokenizer`. Output quality on standard Indonesian benchmarks is far | |
| > below the org's working models (`Kiel-Pro-0.5B-v3`, `AksaraLLM-Qwen-1.5B-v5-public`). | |
| ## What this is | |
| A **271M-parameter** decoder-only transformer trained from scratch as | |
| part of the early AksaraLLM experiments. Architecture (inferred from weight | |
| shapes): | |
| | Property | Value | | |
| |----------|-------| | |
| | Parameters | 271.1M | | |
| | Layers | 16 | | |
| | Heads | 16 | | |
| | Hidden size | 1024 | | |
| | FFN size (SwiGLU) | 2816 | | |
| | Vocabulary | 32000 | | |
| | Context length | 256 | | |
| | RMSNorm + RoPE + SwiGLU | yes | | |
| ## Measured baseline (Devin audit, CPU eval) | |
| We loaded this checkpoint with `aksarallm.model.aksaraLLMModel`, tested several | |
| candidate tokenizers (`AksaraLLM/aksara-tokenizer-v1/v2/v3`, Llama-2 SentencePiece, | |
| GPT-2 BPE), and ran perplexity on 50 short Indonesian Wikipedia-style sentences | |
| plus 5 free-form prompts. Best-case results: | |
| - **Perplexity**: BROKEN (tokenizer mismatch — see Limitations) — meaning the model is **not** modelling Indonesian distribution. | |
| - **English-stopword ratio in ID-prompted output**: 0.0% | |
| - **Indonesian-stopword ratio in ID-prompted output**: 0.0% | |
| Sample output for prompt "Indonesia adalah negara": | |
| ``` | |
| Indonesia adalah negaraate companate compan bet cop4 config G betentent somet compan4ident L coll2 Y great75 from less4 Gil differe�ident L fun speech2 Yost immishalhaps4 eas ind we Qu immis | |
| ``` | |
| ## Why the original "Skor 10/11 (Grade S)" claim is misleading | |
| The score in earlier versions of this README is from a custom 11-question | |
| in-house scorecard graded on a tiny SFT set, not from a standard language | |
| modelling metric. By any standard LM evaluation (perplexity, response | |
| coherence on out-of-distribution prompts, identity calibration), this model | |
| does not function. | |
| ## Limitations | |
| - **Tokenizer not preserved.** Without it, downstream usage is impossible. | |
| - **No HF-compatible config.json** in the original training pipeline; the | |
| ` aksarallm` package is required for loading. | |
| - **Vocab size 32000** — does not match any published AksaraLLM | |
| tokenizer (32 768, 64 000) or common open tokenizers (Llama-2 = 32 000). | |
| - **Trained on a small mixed corpus** (Wikipedia / SFT pairs), insufficient | |
| for general Indonesian generation at this scale. | |
| ## What to use instead | |
| If you want a small Indonesian LM in the AksaraLLM family, use: | |
| - [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, perplexity ≈ 15 on the same eval set. | |
| - [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, perplexity ≈ 8.4. | |
| ## Citation | |
| ``` | |
| @misc{aksarallm-kiel-200m-matured, | |
| author = {Cahyok Putra and AksaraLLM Community}, | |
| title = {Kiel-200M-Matured: early-experiment Indonesian transformer (271M params)}, | |
| year = 2025, | |
| publisher = {Hugging Face}, | |
| howpublished = {\url{https://huggingface.co/AksaraLLM/Kiel-200M-Matured}}, | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0 | |