# LilChatBot WordLevel Tokenizer A **WordLevel tokenizer** trained for the *LilChatBot* project. This tokenizer is designed for clarity, interpretability, and stability rather than maximum compression. It is intended primarily for educational and experimental language-model work. --- ## Design choices - **WordLevel tokenization** (no subword splitting) - **Lowercasing** - **Unicode NFKC normalization** - **Apostrophes preserved everywhere** (e.g. `don't`, `lion's`, `'hello'`, `James'`) - **Aggressive punctuation isolation**, including: - sentence punctuation (`. , ! ? ; :`) - brackets (`() [] {}`) - slashes (`/`) - double quotes (straight and curly) - en/em dashes (`– —`) - **Repeated punctuation collapsed** (`!!! → !`, `??? → ?`, `... → .`) - English-focused This tokenizer intentionally favors **lexical transparency** over vocabulary compactness. --- ## Files - `tokenizer.json` — complete tokenizer definition (normalizer, pre-tokenizer, vocab, special tokens) The tokenizer can be used directly via the `tokenizers` library or wrapped for use with `transformers`. --- ## Usage ### With `transformers` ``` from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained("divilian/lilchatbot-tokenizer") print(tok.decode(tok("The lion's well-being matters — don’t forget that!").input_ids))