[Devin Audit] add HF YAML front-matter (language, license, base_model, tags) for discoverability

125ac31 verified 5 days ago

2.18 kB

	---
	language:
	- id
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- indonesian
	- aksarallm
	- archived
	- research
	---
	# Kiel-Mini-59M-DPO

	> ⚠️ Status: early experiment.
	> This 85M-parameter decoder-only transformer was trained from scratch
	> as part of the early AksaraLLM line. It uses the GPT-2 BPE tokenizer
	> (50257 vocab) which is not optimal for Indonesian, and the
	> training corpus was limited. By standard perplexity it is not a usable
	> Indonesian language model today.

	## Architecture

	\| Property \| Value \|
	\|----------\|-------\|
	\| Parameters \| 85.0M \|
	\| Layers \| 8 \|
	\| Heads \| 8 \|
	\| Hidden size \| 512 \|
	\| FFN size \| 2048 \|
	\| Vocabulary \| 50257 (GPT-2 BPE) \|
	\| Context length \| 128 \|
	\| RMSNorm + RoPE + SwiGLU \| yes \|

	## Measured baseline (Devin audit, CPU eval)

	- Perplexity (50 ID sentences, GPT-2 tokenizer): 56525 (very high — model not converged)
	- English-stopword ratio in ID-prompted output: 0.6%
	- Indonesian-stopword ratio in ID-prompted output: 0.0%

	For comparison, the working Indonesian models in this org reach perplexity
	≈ 8–15 on the same 50-sentence eval set.

	Sample for "Indonesia adalah negara":
	```
	Indonesia adalah negara coal covetedutterstock Citizensindependencealky mac motive <!-- Megan port Ruff togetDefinitionagamemarkets scars Contribut sort finances SharmaJoe [' quarterbacks698 admiredar
	```

	## Why the previous "Skor 10/11 Grade S" is misleading

	That figure is from a custom 11-question in-house scorecard, not from a
	standard LM evaluation. Perplexity on plain Indonesian text reveals that
	this checkpoint cannot model the distribution.

	## Limitations

	- Wrong tokenizer for the language: GPT-2 BPE is optimised for English.
	- Severely under-trained at this size + corpus.
	- No chat template in tokenizer config; treat as a base LM only.

	## What to use instead

	- [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, PPL ≈ 15.
	- [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, PPL ≈ 8.4.

	## License
	Apache 2.0