File size: 2,183 Bytes
c78f391
 
 
 
125ac31
 
c78f391
 
125ac31
 
 
c78f391
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
language:
- id
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- indonesian
- aksarallm
- archived
- research
---
# Kiel-Mini-59M-DPO

> ⚠️ **Status: early experiment.**
> This 85M-parameter decoder-only transformer was trained from scratch
> as part of the early AksaraLLM line. It uses the **GPT-2 BPE** tokenizer
> (50257 vocab) which is not optimal for Indonesian, and the
> training corpus was limited. By standard perplexity it is **not** a usable
> Indonesian language model today.

## Architecture

| Property | Value |
|----------|-------|
| Parameters | 85.0M |
| Layers | 8 |
| Heads | 8 |
| Hidden size | 512 |
| FFN size | 2048 |
| Vocabulary | 50257 (GPT-2 BPE) |
| Context length | 128 |
| RMSNorm + RoPE + SwiGLU | yes |

## Measured baseline (Devin audit, CPU eval)

- **Perplexity** (50 ID sentences, GPT-2 tokenizer): 56525 (very high — model not converged)
- **English-stopword ratio in ID-prompted output**: 0.6%
- **Indonesian-stopword ratio in ID-prompted output**: 0.0%

For comparison, the working Indonesian models in this org reach perplexity
≈ 8–15 on the same 50-sentence eval set.

Sample for "Indonesia adalah negara":
```
Indonesia adalah negara coal covetedutterstock Citizensindependencealky mac motive <!-- Megan port Ruff togetDefinitionagamemarkets scars Contribut sort finances SharmaJoe [' quarterbacks698 admiredar
```

## Why the previous "Skor 10/11 Grade S" is misleading

That figure is from a custom 11-question in-house scorecard, not from a
standard LM evaluation. Perplexity on plain Indonesian text reveals that
this checkpoint cannot model the distribution.

## Limitations

- **Wrong tokenizer for the language**: GPT-2 BPE is optimised for English.
- **Severely under-trained** at this size + corpus.
- **No chat template** in tokenizer config; treat as a base LM only.

## What to use instead

- [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, PPL ≈ 15.
- [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, PPL ≈ 8.4.

## License
Apache 2.0