Upload 131K BPE tokenizer (vocab=131072, fertility id=1.22 en=1.28)

Browse files

Files changed (4) hide show

README.md +67 -9
special_tokens_map.json +6 -0
tokenizer.json +0 -0
tokenizer_config.json +11 -18

README.md CHANGED Viewed

@@ -1,13 +1,71 @@
 ---
 license: apache-2.0
-language: id
-library_name: tokenizers
-tags: [bpe, indonesian, aksarallm]
 ---
-# Ezekiel999/aksara-tokenizer-20b (smoke test)
-Tiny BPE tokenizer (vocab 4096) uploaded from Devin scaffolding session to
-validate the upload path. **Not for training** — the real 131072-vocab
-tokenizer is produced by `aksara-tokenizer/scripts/train_tokenizer_20b.py`
-on the full Indonesian corpus.
-Special tokens: [BOS] [EOS] [PAD] [UNK] [SYS] [/SYS] [INST] [/INST].

 ---
 license: apache-2.0
+language:
+  - id
+  - en
+  - ms
+  - jv
+  - su
+tags:
+  - tokenizer
+  - bpe
+  - aksarallm
+  - indonesian
 ---
+# AksaraLLM 20B Tokenizer
+Byte-level BPE tokenizer for the [AksaraLLM 20B](https://github.com/cahyohackids/AksaraLLM) pre-training run.
+- **Vocab size**: 131,072
+- **Algorithm**: Byte-level BPE (GPT-2 / LLaMA-3 style)
+- **Training corpus**: ~12 GB balanced sample (English web / Indonesian web / Indonesian Wikipedia / Malay / Javanese / Sundanese) from FineWeb, FineWeb-2, CulturaX, and Wikipedia
+- **Produced by**: `scripts/train_tokenizer_20b.py` in the AksaraLLM repo
+## Special tokens (pinned IDs)
+The first 14 IDs are reserved for named special tokens, in this order:
+| ID | Token |
+|----|-------|
+| 0 | `<\|pad\|>` |
+| 1 | `<\|bos\|>` |
+| 2 | `<\|eos\|>` |
+| 3 | `<\|unk\|>` |
+| 4 | `<\|system\|>` |
+| 5 | `<\|user\|>` |
+| 6 | `<\|assistant\|>` |
+| 7 | `<\|tool\|>` |
+| 8 | `<\|im_start\|>` |
+| 9 | `<\|im_end\|>` |
+| 10 | `<\|fim_prefix\|>` |
+| 11 | `<\|fim_middle\|>` |
+| 12 | `<\|fim_suffix\|>` |
+| 13 | `<\|endoftext\|>` |
+The last 256 IDs (130816–131071) are reserved as `<|reserved_N|>` for future expansion without breaking already-pretrained checkpoints.
+## Fertility (tokens per whitespace-word)
+Measured on ~200 KB held-out samples from each language:
+| Language | Fertility | Target |
+|----------|-----------|--------|
+| English web | 1.280 | ≤ 1.40 |
+| Indonesian wiki | 1.357 | ≤ 1.60 |
+| Indonesian web (CulturaX) | 1.215 | ≤ 1.60 |
+| Malay wiki | 1.368 | ≤ 1.60 |
+| Javanese wiki | 1.657 | ≤ 1.80 |
+## Usage
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("Ezekiel999/aksara-tokenizer-20b")
+ids = tok("Halo dunia, saya AksaraLLM.", add_special_tokens=False).input_ids
+# → 8 tokens
+```
+## License
+Apache-2.0.

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "bos_token": "<|bos|>",
+  "eos_token": "<|eos|>",
+  "pad_token": "<|pad|>",
+  "unk_token": "<|unk|>"
+}

tokenizer.json CHANGED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json CHANGED Viewed

@@ -1,20 +1,13 @@
 {
-  "tokenizer_class": "AksaraTokenizer",
-  "bos_token": "[BOS]",
-  "eos_token": "[EOS]",
-  "pad_token": "[PAD]",
-  "unk_token": "[UNK]",
-  "added_special_tokens": [
-    "[BOS]",
-    "[EOS]",
-    "[PAD]",
-    "[UNK]",
-    "[SYS]",
-    "[/SYS]",
-    "[INST]",
-    "[/INST]"
-  ],
-  "chat_template": "{% if messages[0]['role'] == 'system' %}[SYS]{{ messages[0]['content'] }}[/SYS]{% set messages = messages[1:] %}{% endif %}{% for message in messages %}{% if message['role'] == 'user' %}[INST]{{ message['content'] }}[/INST]{% elif message['role'] == 'assistant' %}{{ message['content'] }}[EOS]{% endif %}{% endfor %}",
-  "default_system_prompt": "Kamu adalah AksaraLLM, asisten AI berbahasa Indonesia yang cerdas, sopan, dan membantu. Jawab dengan jelas, jujur, dan ringkas.",
-  "model_max_length": 8192
 }

 {
+  "add_bos_token": false,
+  "add_eos_token": false,
+  "added_tokens_decoder": {},
+  "bos_token": "<|bos|>",
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|eos|>",
+  "model_max_length": 131072,
+  "pad_token": "<|pad|>",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "<|unk|>",
+  "chat_template": "{% for message in messages %}{{ '<|im_start|>' + message['role'] + '\\n' + message['content'] + '<|im_end|>' + '\\n' }}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\\n' }}{% endif %}"
 }