[Devin Audit] add HF YAML front-matter (language, license, base_model, tags) for discoverability

183d27e verified 5 days ago

3.45 kB

	---
	language:
	- id
	license: apache-2.0
	library_name: transformers
	pipeline_tag: text-generation
	tags:
	- indonesian
	- aksarallm
	- archived
	- research
	---
	# Kiel-200M-Matured

	> ⚠️ Status: research artifact / early experiment, not a usable language model today.
	> The tokenizer used at training time was not preserved in this repository,
	> so the checkpoint cannot be loaded end-to-end with HF `AutoModel` /
	> `AutoTokenizer`. Output quality on standard Indonesian benchmarks is far
	> below the org's working models (`Kiel-Pro-0.5B-v3`, `AksaraLLM-Qwen-1.5B-v5-public`).

	## What this is

	A 271M-parameter decoder-only transformer trained from scratch as
	part of the early AksaraLLM experiments. Architecture (inferred from weight
	shapes):

	\| Property \| Value \|
	\|----------\|-------\|
	\| Parameters \| 271.1M \|
	\| Layers \| 16 \|
	\| Heads \| 16 \|
	\| Hidden size \| 1024 \|
	\| FFN size (SwiGLU) \| 2816 \|
	\| Vocabulary \| 32000 \|
	\| Context length \| 256 \|
	\| RMSNorm + RoPE + SwiGLU \| yes \|

	## Measured baseline (Devin audit, CPU eval)

	We loaded this checkpoint with `aksarallm.model.aksaraLLMModel`, tested several
	candidate tokenizers (`AksaraLLM/aksara-tokenizer-v1/v2/v3`, Llama-2 SentencePiece,
	GPT-2 BPE), and ran perplexity on 50 short Indonesian Wikipedia-style sentences
	plus 5 free-form prompts. Best-case results:

	- Perplexity: BROKEN (tokenizer mismatch — see Limitations) — meaning the model is not modelling Indonesian distribution.
	- English-stopword ratio in ID-prompted output: 0.0%
	- Indonesian-stopword ratio in ID-prompted output: 0.0%

	Sample output for prompt "Indonesia adalah negara":

	```
	Indonesia adalah negaraate companate compan bet cop4 config G betentent somet compan4ident L coll2 Y great75 from less4 Gil differe�ident L fun speech2 Yost immishalhaps4 eas ind we Qu immis
	```

	## Why the original "Skor 10/11 (Grade S)" claim is misleading

	The score in earlier versions of this README is from a custom 11-question
	in-house scorecard graded on a tiny SFT set, not from a standard language
	modelling metric. By any standard LM evaluation (perplexity, response
	coherence on out-of-distribution prompts, identity calibration), this model
	does not function.

	## Limitations

	- Tokenizer not preserved. Without it, downstream usage is impossible.
	- No HF-compatible config.json in the original training pipeline; the
	` aksarallm` package is required for loading.
	- Vocab size 32000 — does not match any published AksaraLLM
	tokenizer (32 768, 64 000) or common open tokenizers (Llama-2 = 32 000).
	- Trained on a small mixed corpus (Wikipedia / SFT pairs), insufficient
	for general Indonesian generation at this scale.

	## What to use instead

	If you want a small Indonesian LM in the AksaraLLM family, use:

	- [`AksaraLLM/Kiel-Pro-0.5B-v3`](https://huggingface.co/AksaraLLM/Kiel-Pro-0.5B-v3) — 494M Qwen2-based, perplexity ≈ 15 on the same eval set.
	- [`AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public`](https://huggingface.co/AksaraLLM/AksaraLLM-Qwen-1.5B-v5-public) — 1.78B Qwen2-based, perplexity ≈ 8.4.

	## Citation

	```
	@misc{aksarallm-kiel-200m-matured,
	author = {Cahyok Putra and AksaraLLM Community},
	title = {Kiel-200M-Matured: early-experiment Indonesian transformer (271M params)},
	year = 2025,
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/AksaraLLM/Kiel-200M-Matured}},
	}
	```

	## License

	Apache 2.0