Ezekiel999 commited on
Commit
e662de4
·
verified ·
1 Parent(s): c374a49

[Devin Audit] update model card with measured baseline metrics + honest framing

Browse files

See https://huggingface.co/datasets/AksaraLLM/audit-2026-04 (or the AUDIT_REPORT.md attached) for methodology and the full per-model eval results.

Files changed (1) hide show
  1. README.md +56 -4
README.md CHANGED
@@ -1,16 +1,68 @@
1
  ---
2
  language:
3
  - id
 
4
  license: apache-2.0
5
  tags:
6
  - aksarallm
7
- - matured
 
 
8
  pipeline_tag: text-generation
9
  ---
10
 
11
  # Kiel-59M-Matured
12
 
13
- 59M params, matured dengan 50 SFT pairs.
14
- Skor: 11/11 (100%) Grade S
 
 
 
 
15
 
16
- **AksaraLLM Community**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language:
3
  - id
4
+ - en
5
  license: apache-2.0
6
  tags:
7
  - aksarallm
8
+ - from-scratch
9
+ - indonesian
10
+ - early-experiment
11
  pipeline_tag: text-generation
12
  ---
13
 
14
  # Kiel-59M-Matured
15
 
16
+ > ⚠️ **Status: early experiment.**
17
+ > This 85M-parameter decoder-only transformer was trained from scratch
18
+ > as part of the early AksaraLLM line. It uses the **GPT-2 BPE** tokenizer
19
+ > (50257 vocab) which is not optimal for Indonesian, and the
20
+ > training corpus was limited. By standard perplexity it is **not** a usable
21
+ > Indonesian language model today.
22
 
23
+ ## Architecture
24
+
25
+ | Property | Value |
26
+ |----------|-------|
27
+ | Parameters | 85.0M |
28
+ | Layers | 8 |
29
+ | Heads | 8 |
30
+ | Hidden size | 512 |
31
+ | FFN size | 2048 |
32
+ | Vocabulary | 50257 (GPT-2 BPE) |
33
+ | Context length | 256 |
34
+ | RMSNorm + RoPE + SwiGLU | yes |
35
+
36
+ ## Measured baseline (Devin audit, CPU eval)
37
+
38
+ - **Perplexity** (50 ID sentences, GPT-2 tokenizer): 23154 (very high — model not converged)
39
+ - **English-stopword ratio in ID-prompted output**: 0.0%
40
+ - **Indonesian-stopword ratio in ID-prompted output**: 0.0%
41
+
42
+ For comparison, the working Indonesian models in this org reach perplexity
43
+ ≈ 8–15 on the same 50-sentence eval set.
44
+
45
+ Sample for "Indonesia adalah negara":
46
+ ```
47
+ Indonesia adalah negaraalum questionich4!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
48
+ ```
49
+
50
+ ## Why the previous "Skor 10/11 Grade S" is misleading
51
+
52
+ That figure is from a custom 11-question in-house scorecard, not from a
53
+ standard LM evaluation. Perplexity on plain Indonesian text reveals that
54
+ this checkpoint cannot model the distribution.
55
+
56
+ ## Limitations
57
+
58
+ - **Wrong tokenizer for the language**: GPT-2 BPE is optimised for English.
59
+ - **Severely under-trained** at this size + corpus.
60
+ - **No chat template** in tokenizer config; treat as a base LM only.
61
+
62
+ ## What to use instead
63
+
64
+ - [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, PPL ≈ 15.
65
+ - [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, PPL ≈ 8.4.
66
+
67
+ ## License
68
+ Apache 2.0