File size: 2,183 Bytes
c78f391 125ac31 c78f391 125ac31 c78f391 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | ---
language:
- id
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- indonesian
- aksarallm
- archived
- research
---
# Kiel-Mini-59M-DPO
> ⚠️ **Status: early experiment.**
> This 85M-parameter decoder-only transformer was trained from scratch
> as part of the early AksaraLLM line. It uses the **GPT-2 BPE** tokenizer
> (50257 vocab) which is not optimal for Indonesian, and the
> training corpus was limited. By standard perplexity it is **not** a usable
> Indonesian language model today.
## Architecture
| Property | Value |
|----------|-------|
| Parameters | 85.0M |
| Layers | 8 |
| Heads | 8 |
| Hidden size | 512 |
| FFN size | 2048 |
| Vocabulary | 50257 (GPT-2 BPE) |
| Context length | 128 |
| RMSNorm + RoPE + SwiGLU | yes |
## Measured baseline (Devin audit, CPU eval)
- **Perplexity** (50 ID sentences, GPT-2 tokenizer): 56525 (very high — model not converged)
- **English-stopword ratio in ID-prompted output**: 0.6%
- **Indonesian-stopword ratio in ID-prompted output**: 0.0%
For comparison, the working Indonesian models in this org reach perplexity
≈ 8–15 on the same 50-sentence eval set.
Sample for "Indonesia adalah negara":
```
Indonesia adalah negara coal covetedutterstock Citizensindependencealky mac motive <!-- Megan port Ruff togetDefinitionagamemarkets scars Contribut sort finances SharmaJoe [' quarterbacks698 admiredar
```
## Why the previous "Skor 10/11 Grade S" is misleading
That figure is from a custom 11-question in-house scorecard, not from a
standard LM evaluation. Perplexity on plain Indonesian text reveals that
this checkpoint cannot model the distribution.
## Limitations
- **Wrong tokenizer for the language**: GPT-2 BPE is optimised for English.
- **Severely under-trained** at this size + corpus.
- **No chat template** in tokenizer config; treat as a base LM only.
## What to use instead
- [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, PPL ≈ 15.
- [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, PPL ≈ 8.4.
## License
Apache 2.0
|