Make transformers (not tokenizers) compatible.

Files changed (3) hide show

README.md ADDED Viewed

+# LilChatBot WordLevel Tokenizer
+A **WordLevel tokenizer** trained for the *LilChatBot* project.
+This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work.
+---
+## Design choices
+- **WordLevel tokenization** (no subword splitting)
+- **Lowercasing**
+- **Unicode NFKC normalization**
+- **Apostrophes preserved everywhere**
+  (e.g. `don't`, `lion's`, `'hello'`, `James'`)
+- **Aggressive punctuation isolation**, including:
+  - sentence punctuation (`. , ! ? ; :`)
+  - brackets (`() [] {}`)
+  - slashes (`/`)
+  - double quotes (straight and curly)
+  - en/em dashes (`– —`)
+- **Repeated punctuation collapsed**
+  (`!!! → !`, `??? → ?`, `... → .`)
+- English-focused
+This tokenizer intentionally favors **lexical transparency** over vocabulary compactness.
+---
+## Files
+- `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens)
+The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`.
+---
+## Usage
+### With `transformers`
+```python
+from transformers import AutoTokenizer
+tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer")
+print(tok.encode("The lion’s well-being matters — don't forget that!").tokens)

special_tokens_map.json ADDED Viewed

+{
+  "unk_token": "<unk>",
+  "pad_token": "<pad>",
+  "bos_token": "<bos>",
+  "eos_token": "<eos>"
+}

tokenizer_config.json ADDED Viewed

+{
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "model_max_length": 128,
+}