divilian commited on
Commit
16d69c2
·
1 Parent(s): 09854b8

Make transformers (not tokenizers) compatible.

Browse files
Files changed (3) hide show
  1. README.md +47 -0
  2. special_tokens_map.json +6 -0
  3. tokenizer_config.json +4 -0
README.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LilChatBot WordLevel Tokenizer
2
+
3
+ A **WordLevel tokenizer** trained for the *LilChatBot* project.
4
+
5
+ This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.
6
+
7
+ ---
8
+
9
+ ## Design choices
10
+
11
+ - **WordLevel tokenization** (no subword splitting)
12
+ - **Lowercasing**
13
+ - **Unicode NFKC normalization**
14
+ - **Apostrophes preserved everywhere**
15
+ (e.g. `don't`, `lion's`, `'hello'`, `James'`)
16
+ - **Aggressive punctuation isolation**, including:
17
+ - sentence punctuation (`. , ! ? ; :`)
18
+ - brackets (`() [] {}`)
19
+ - slashes (`/`)
20
+ - double quotes (straight and curly)
21
+ - en/em dashes (`– —`)
22
+ - **Repeated punctuation collapsed**
23
+ (`!!! → !`, `??? → ?`, `... → .`)
24
+ - English-focused
25
+
26
+ This tokenizer intentionally favors **lexical transparency** over vocabulary compactness.
27
+
28
+ ---
29
+
30
+ ## Files
31
+
32
+ - `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)
33
+
34
+ The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`.
35
+
36
+ ---
37
+
38
+ ## Usage
39
+
40
+ ### With `transformers`
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer
44
+
45
+ tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")
46
+
47
+ print(tok.encode("The lion’s well-being matters — don't forget that!").tokens)
special_tokens_map.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "unk_token": "<unk>",
3
+ "pad_token": "<pad>",
4
+ "bos_token": "<bos>",
5
+ "eos_token": "<eos>"
6
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "tokenizer_class": "PreTrainedTokenizerFast",
3
+ "model_max_length": 128,
4
+ }