Ezekiel999 commited on
Commit
01be583
·
verified ·
1 Parent(s): 463126a

Upload 131K BPE tokenizer (vocab=131072, fertility id=1.22 en=1.28)

Browse files
Files changed (4) hide show
  1. README.md +67 -9
  2. special_tokens_map.json +6 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +11 -18
README.md CHANGED
@@ -1,13 +1,71 @@
1
  ---
2
  license: apache-2.0
3
- language: id
4
- library_name: tokenizers
5
- tags: [bpe, indonesian, aksarallm]
 
 
 
 
 
 
 
 
6
  ---
7
- # Ezekiel999/aksara-tokenizer-20b (smoke test)
8
 
9
- Tiny BPE tokenizer (vocab 4096) uploaded from Devin scaffolding session to
10
- validate the upload path. **Not for training** — the real 131072-vocab
11
- tokenizer is produced by `aksara-tokenizer/scripts/train_tokenizer_20b.py`
12
- on the full Indonesian corpus.
13
- Special tokens: [BOS] [EOS] [PAD] [UNK] [SYS] [/SYS] [INST] [/INST].
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - id
5
+ - en
6
+ - ms
7
+ - jv
8
+ - su
9
+ tags:
10
+ - tokenizer
11
+ - bpe
12
+ - aksarallm
13
+ - indonesian
14
  ---
 
15
 
16
+ # AksaraLLM 20B Tokenizer
17
+
18
+ Byte-level BPE tokenizer for the [AksaraLLM 20B](https://github.com/cahyohackids/AksaraLLM) pre-training run.
19
+
20
+ - **Vocab size**: 131,072
21
+ - **Algorithm**: Byte-level BPE (GPT-2 / LLaMA-3 style)
22
+ - **Training corpus**: ~12 GB balanced sample (English web / Indonesian web / Indonesian Wikipedia / Malay / Javanese / Sundanese) from FineWeb, FineWeb-2, CulturaX, and Wikipedia
23
+ - **Produced by**: `scripts/train_tokenizer_20b.py` in the AksaraLLM repo
24
+
25
+ ## Special tokens (pinned IDs)
26
+
27
+ The first 14 IDs are reserved for named special tokens, in this order:
28
+
29
+ | ID | Token |
30
+ |----|-------|
31
+ | 0 | `<\|pad\|>` |
32
+ | 1 | `<\|bos\|>` |
33
+ | 2 | `<\|eos\|>` |
34
+ | 3 | `<\|unk\|>` |
35
+ | 4 | `<\|system\|>` |
36
+ | 5 | `<\|user\|>` |
37
+ | 6 | `<\|assistant\|>` |
38
+ | 7 | `<\|tool\|>` |
39
+ | 8 | `<\|im_start\|>` |
40
+ | 9 | `<\|im_end\|>` |
41
+ | 10 | `<\|fim_prefix\|>` |
42
+ | 11 | `<\|fim_middle\|>` |
43
+ | 12 | `<\|fim_suffix\|>` |
44
+ | 13 | `<\|endoftext\|>` |
45
+
46
+ The last 256 IDs (130816–131071) are reserved as `<|reserved_N|>` for future expansion without breaking already-pretrained checkpoints.
47
+
48
+ ## Fertility (tokens per whitespace-word)
49
+
50
+ Measured on ~200 KB held-out samples from each language:
51
+
52
+ | Language | Fertility | Target |
53
+ |----------|-----------|--------|
54
+ | English web | 1.280 | ≤ 1.40 |
55
+ | Indonesian wiki | 1.357 | ≤ 1.60 |
56
+ | Indonesian web (CulturaX) | 1.215 | ≤ 1.60 |
57
+ | Malay wiki | 1.368 | ≤ 1.60 |
58
+ | Javanese wiki | 1.657 | ≤ 1.80 |
59
+
60
+ ## Usage
61
+
62
+ ```python
63
+ from transformers import AutoTokenizer
64
+ tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
65
+ ids = tok("Halo dunia, saya AksaraLLM.", add_special_tokens=False).input_ids
66
+ # → 8 tokens
67
+ ```
68
+
69
+ ## License
70
+
71
+ Apache-2.0.
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|bos|>",
3
+ "eos_token": "<|eos|>",
4
+ "pad_token": "<|pad|>",
5
+ "unk_token": "<|unk|>"
6
+ }
tokenizer.json CHANGED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json CHANGED
@@ -1,20 +1,13 @@
1
  {
2
- "tokenizer_class": "AksaraTokenizer",
3
- "bos_token": "[BOS]",
4
- "eos_token": "[EOS]",
5
- "pad_token": "[PAD]",
6
- "unk_token": "[UNK]",
7
- "added_special_tokens": [
8
- "[BOS]",
9
- "[EOS]",
10
- "[PAD]",
11
- "[UNK]",
12
- "[SYS]",
13
- "[/SYS]",
14
- "[INST]",
15
- "[/INST]"
16
- ],
17
- "chat_template": "{% if messages[0]['role'] == 'system' %}[SYS]{{ messages[0]['content'] }}[/SYS]{% set messages = messages[1:] %}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}[INST]{{ message['content'] }}[/INST]{% elif message['role'] == 'assistant' %}{{ message['content'] }}[EOS]{% endif %}{% endfor %}",
18
- "default_system_prompt": "Kamu adalah AksaraLLM, asisten AI berbahasa Indonesia yang cerdas, sopan, dan membantu. Jawab dengan jelas, jujur, dan ringkas.",
19
- "model_max_length": 8192
20
  }
 
1
  {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "added_tokens_decoder": {},
5
+ "bos_token": "<|bos|>",
6
+ "clean_up_tokenization_spaces": false,
7
+ "eos_token": "<|eos|>",
8
+ "model_max_length": 131072,
9
+ "pad_token": "<|pad|>",
10
+ "tokenizer_class": "PreTrainedTokenizerFast",
11
+ "unk_token": "<|unk|>",
12
+ "chat_template": "{% for message in messages %}{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}"
 
 
 
 
 
 
 
13
  }