Ezekiel999 commited on
Commit
d823769
·
verified ·
1 Parent(s): 8e8fb99

[Devin Audit] update model card with measured baseline metrics + honest framing

Browse files

See https://huggingface.co/datasets/AksaraLLM/audit-2026-04 (or the AUDIT_REPORT.md attached) for methodology and the full per-model eval results.

Files changed (1) hide show
  1. README.md +75 -0
README.md ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ - en
5
+ license: apache-2.0
6
+ tags:
7
+ - aksarallm
8
+ - indonesian
9
+ - llama
10
+ - from-scratch
11
+ - text-generation
12
+ pipeline_tag: text-generation
13
+ library_name: transformers
14
+ ---
15
+
16
+ # aksarallm-1.5b-native
17
+
18
+ The first **fully from-scratch** AksaraLLM 1.5B model (2.04B actual params),
19
+ LLaMA-style architecture. Where the `AksaraLLM-Qwen-1.5B*` line is descended
20
+ from Qwen2, this checkpoint contains no inherited weights — it was trained
21
+ from random init on AksaraLLM's own corpus and tokenizer.
22
+
23
+ ## Measured baseline (Devin audit, CPU bf16, 50 short Indonesian sentences)
24
+
25
+ | Metric | Value |
26
+ |---|---|
27
+ | Perplexity | **113.5** (much higher than Qwen-derived models, see below) |
28
+ | English-stopword ratio in ID-prompted output | 0.0% |
29
+ | Indonesian-stopword ratio in ID-prompted output | **31.3%** (highest of any AksaraLLM model — most Indonesian-saturated) |
30
+ | Parameters | 2039.0 M |
31
+ | Architecture | LlamaForCausalLM |
32
+ | Vocabulary | 151 665 |
33
+
34
+ ## Why the high perplexity?
35
+
36
+ This model started from random init and has been trained on a smaller
37
+ corpus than the Qwen2-derived models, which began with ~5 T tokens of pretraining
38
+ already baked in. PPL ≈ 113 reflects "model is converging on Indonesian
39
+ distribution but not fully there yet". The very high Indonesian-word
40
+ ratio (31%) and zero English leak suggest the model is producing
41
+ Indonesian-only output even when uncertain — a useful signal that the
42
+ language identity is correctly trained, but the lexical / factual quality
43
+ is below the Qwen-derived models.
44
+
45
+ This is the **honest from-scratch baseline** for the AksaraLLM project. It is
46
+ the right reference point when measuring how much value continued
47
+ pretraining / from-scratch with a larger corpus delivers (which is exactly
48
+ what the planned 20B aims to demonstrate).
49
+
50
+ ## Loading notes
51
+
52
+ The checkpoint contains legacy `rope.sin_cached` and `rope.cos_cached`
53
+ keys that are unexpected by HF's `LlamaForCausalLM`; HF silently drops
54
+ them on load — this is benign. Same `tie_word_embeddings` config / checkpoint
55
+ mismatch as the Qwen variants; recommend setting `tie_word_embeddings: false`
56
+ in `config.json`.
57
+
58
+ ## Quickstart
59
+
60
+ ```python
61
+ from transformers import AutoTokenizer, AutoModelForCausalLM
62
+ import torch
63
+
64
+ tok = AutoTokenizer.from_pretrained("AksaraLLM/aksarallm-1.5b-native")
65
+ model = AutoModelForCausalLM.from_pretrained(
66
+ "AksaraLLM/aksarallm-1.5b-native",
67
+ torch_dtype=torch.bfloat16,
68
+ device_map="auto",
69
+ )
70
+ inp = tok("Indonesia adalah negara", return_tensors="pt").to(model.device)
71
+ print(tok.decode(model.generate(**inp, max_new_tokens=120, do_sample=True, top_p=0.9)[0], skip_special_tokens=True))
72
+ ```
73
+
74
+ ## License
75
+ Apache 2.0