Upload 131K BPE tokenizer (vocab=131072, fertility id=1.22 en=1.28)
Browse files- README.md +67 -9
- special_tokens_map.json +6 -0
- tokenizer.json +0 -0
- tokenizer_config.json +11 -18
README.md
CHANGED
|
@@ -1,13 +1,71 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
-
language:
|
| 4 |
-
|
| 5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
-
# Ezekiel999/aksara-tokenizer-20b (smoke test)
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
tokenizer
|
| 12 |
-
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- id
|
| 5 |
+
- en
|
| 6 |
+
- ms
|
| 7 |
+
- jv
|
| 8 |
+
- su
|
| 9 |
+
tags:
|
| 10 |
+
- tokenizer
|
| 11 |
+
- bpe
|
| 12 |
+
- aksarallm
|
| 13 |
+
- indonesian
|
| 14 |
---
|
|
|
|
| 15 |
|
| 16 |
+
# AksaraLLM 20B Tokenizer
|
| 17 |
+
|
| 18 |
+
Byte-level BPE tokenizer for the [AksaraLLM 20B](https://github.com/cahyohackids/AksaraLLM) pre-training run.
|
| 19 |
+
|
| 20 |
+
- **Vocab size**: 131,072
|
| 21 |
+
- **Algorithm**: Byte-level BPE (GPT-2 / LLaMA-3 style)
|
| 22 |
+
- **Training corpus**: ~12 GB balanced sample (English web / Indonesian web / Indonesian Wikipedia / Malay / Javanese / Sundanese) from FineWeb, FineWeb-2, CulturaX, and Wikipedia
|
| 23 |
+
- **Produced by**: `scripts/train_tokenizer_20b.py` in the AksaraLLM repo
|
| 24 |
+
|
| 25 |
+
## Special tokens (pinned IDs)
|
| 26 |
+
|
| 27 |
+
The first 14 IDs are reserved for named special tokens, in this order:
|
| 28 |
+
|
| 29 |
+
| ID | Token |
|
| 30 |
+
|----|-------|
|
| 31 |
+
| 0 | `<\|pad\|>` |
|
| 32 |
+
| 1 | `<\|bos\|>` |
|
| 33 |
+
| 2 | `<\|eos\|>` |
|
| 34 |
+
| 3 | `<\|unk\|>` |
|
| 35 |
+
| 4 | `<\|system\|>` |
|
| 36 |
+
| 5 | `<\|user\|>` |
|
| 37 |
+
| 6 | `<\|assistant\|>` |
|
| 38 |
+
| 7 | `<\|tool\|>` |
|
| 39 |
+
| 8 | `<\|im_start\|>` |
|
| 40 |
+
| 9 | `<\|im_end\|>` |
|
| 41 |
+
| 10 | `<\|fim_prefix\|>` |
|
| 42 |
+
| 11 | `<\|fim_middle\|>` |
|
| 43 |
+
| 12 | `<\|fim_suffix\|>` |
|
| 44 |
+
| 13 | `<\|endoftext\|>` |
|
| 45 |
+
|
| 46 |
+
The last 256 IDs (130816–131071) are reserved as `<|reserved_N|>` for future expansion without breaking already-pretrained checkpoints.
|
| 47 |
+
|
| 48 |
+
## Fertility (tokens per whitespace-word)
|
| 49 |
+
|
| 50 |
+
Measured on ~200 KB held-out samples from each language:
|
| 51 |
+
|
| 52 |
+
| Language | Fertility | Target |
|
| 53 |
+
|----------|-----------|--------|
|
| 54 |
+
| English web | 1.280 | ≤ 1.40 |
|
| 55 |
+
| Indonesian wiki | 1.357 | ≤ 1.60 |
|
| 56 |
+
| Indonesian web (CulturaX) | 1.215 | ≤ 1.60 |
|
| 57 |
+
| Malay wiki | 1.368 | ≤ 1.60 |
|
| 58 |
+
| Javanese wiki | 1.657 | ≤ 1.80 |
|
| 59 |
+
|
| 60 |
+
## Usage
|
| 61 |
+
|
| 62 |
+
```python
|
| 63 |
+
from transformers import AutoTokenizer
|
| 64 |
+
tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
|
| 65 |
+
ids = tok("Halo dunia, saya AksaraLLM.", add_special_tokens=False).input_ids
|
| 66 |
+
# → 8 tokens
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
## License
|
| 70 |
+
|
| 71 |
+
Apache-2.0.
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<|bos|>",
|
| 3 |
+
"eos_token": "<|eos|>",
|
| 4 |
+
"pad_token": "<|pad|>",
|
| 5 |
+
"unk_token": "<|unk|>"
|
| 6 |
+
}
|
tokenizer.json
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
CHANGED
|
@@ -1,20 +1,13 @@
|
|
| 1 |
{
|
| 2 |
-
"
|
| 3 |
-
"
|
| 4 |
-
"
|
| 5 |
-
"
|
| 6 |
-
"
|
| 7 |
-
"
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
"[/SYS]",
|
| 14 |
-
"[INST]",
|
| 15 |
-
"[/INST]"
|
| 16 |
-
],
|
| 17 |
-
"chat_template": "{% if messages[0]['role'] == 'system' %}[SYS]{{ messages[0]['content'] }}[/SYS]{% set messages = messages[1:] %}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}[INST]{{ message['content'] }}[/INST]{% elif message['role'] == 'assistant' %}{{ message['content'] }}[EOS]{% endif %}{% endfor %}",
|
| 18 |
-
"default_system_prompt": "Kamu adalah AksaraLLM, asisten AI berbahasa Indonesia yang cerdas, sopan, dan membantu. Jawab dengan jelas, jujur, dan ringkas.",
|
| 19 |
-
"model_max_length": 8192
|
| 20 |
}
|
|
|
|
| 1 |
{
|
| 2 |
+
"add_bos_token": false,
|
| 3 |
+
"add_eos_token": false,
|
| 4 |
+
"added_tokens_decoder": {},
|
| 5 |
+
"bos_token": "<|bos|>",
|
| 6 |
+
"clean_up_tokenization_spaces": false,
|
| 7 |
+
"eos_token": "<|eos|>",
|
| 8 |
+
"model_max_length": 131072,
|
| 9 |
+
"pad_token": "<|pad|>",
|
| 10 |
+
"tokenizer_class": "PreTrainedTokenizerFast",
|
| 11 |
+
"unk_token": "<|unk|>",
|
| 12 |
+
"chat_template": "{% for message in messages %}{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
}
|