[Devin Audit] update model card with measured baseline metrics + honest framing

See https://huggingface.co/datasets/AksaraLLM/audit-2026-04 (or the AUDIT_REPORT.md attached) for methodology and the full per-model eval results.

Files changed (1) hide show

README.md +83 -4

README.md CHANGED Viewed

@@ -4,13 +4,92 @@ language:
 license: apache-2.0
 tags:
 - aksarallm
-- matured
 pipeline_tag: text-generation
 ---
 # Kiel-200M-Matured
-238M params, matured dengan 50 SFT pairs.
-Skor: 9/11 (82%) — Grade A
-**AksaraLLM Community**

 license: apache-2.0
 tags:
 - aksarallm
+- from-scratch
+- indonesian
+- early-experiment
+- research-artifact
 pipeline_tag: text-generation
 ---
 # Kiel-200M-Matured
+> ⚠️ **Status: research artifact / early experiment, not a usable language model today.**
+> The tokenizer used at training time was not preserved in this repository,
+> so the checkpoint cannot be loaded end-to-end with HF `AutoModel` /
+> `AutoTokenizer`. Output quality on standard Indonesian benchmarks is far
+> below the org's working models (`Kiel-Pro-0.5B-v3`, `AksaraLLM-Qwen-1.5B-v5-public`).
+## What this is
+A **271M-parameter** decoder-only transformer trained from scratch as
+part of the early AksaraLLM experiments. Architecture (inferred from weight
+shapes):
+| Property | Value |
+|----------|-------|
+| Parameters | 271.1M |
+| Layers | 16 |
+| Heads | 16 |
+| Hidden size | 1024 |
+| FFN size (SwiGLU) | 2816 |
+| Vocabulary | 32000 |
+| Context length | 256 |
+| RMSNorm + RoPE + SwiGLU | yes |
+## Measured baseline (Devin audit, CPU eval)
+We loaded this checkpoint with `aksarallm.model.aksaraLLMModel`, tested several
+candidate tokenizers (`AksaraLLM/aksara-tokenizer-v1/v2/v3`, Llama-2 SentencePiece,
+GPT-2 BPE), and ran perplexity on 50 short Indonesian Wikipedia-style sentences
+plus 5 free-form prompts. Best-case results:
+- **Perplexity**: BROKEN (tokenizer mismatch — see Limitations) — meaning the model is **not** modelling Indonesian distribution.
+- **English-stopword ratio in ID-prompted output**: 0.0%
+- **Indonesian-stopword ratio in ID-prompted output**: 0.0%
+Sample output for prompt "Indonesia adalah negara":
+```
+Indonesia adalah negaraate companate compan bet cop4 config G betentent somet compan4ident L coll2 Y great75 from less4 Gil differe�ident L fun speech2 Yost immishalhaps4 eas ind we Qu immis
+```
+## Why the original "Skor 10/11 (Grade S)" claim is misleading
+The score in earlier versions of this README is from a custom 11-question
+in-house scorecard graded on a tiny SFT set, not from a standard language
+modelling metric. By any standard LM evaluation (perplexity, response
+coherence on out-of-distribution prompts, identity calibration), this model
+does not function.
+## Limitations
+- **Tokenizer not preserved.** Without it, downstream usage is impossible.
+- **No HF-compatible config.json** in the original training pipeline; the
+  ` aksarallm` package is required for loading.
+- **Vocab size 32000** — does not match any published AksaraLLM
+  tokenizer (32 768, 64 000) or common open tokenizers (Llama-2 = 32 000).
+- **Trained on a small mixed corpus** (Wikipedia / SFT pairs), insufficient
+  for general Indonesian generation at this scale.
+## What to use instead
+If you want a small Indonesian LM in the AksaraLLM family, use:
+- [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, perplexity ≈ 15 on the same eval set.
+- [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, perplexity ≈ 8.4.
+## Citation
+```
+@misc{aksarallm-kiel-200m-matured,
+  author       = {Cahyok Putra and AksaraLLM Community},
+  title        = {Kiel-200M-Matured: early-experiment Indonesian transformer (271M params)},
+  year         = 2025,
+  publisher    = {Hugging Face},
+  howpublished = {\url{https://huggingface.co/AksaraLLM/Kiel-200M-Matured}},
+}
+```
+## License
+Apache 2.0